XHTML

Combining the content-semantics of HTML with the structure of XML, XHTML is great if you want to automate processing of information which is destined for human consumption.

Sample XHTML Document

The simplest possible XHTML document:

<?xml version="1.0" encoding="utf-8" standalone="no"?>
<html
   lang="en"
   xml:lang="en"

   xsi:schemaLocation="http://www.w3.org/1999/xhtml
     http://www.w3.org/2002/08/xhtml/xhtml1-strict.xsd">
  <head>
    <meta
       http-equiv="Content-Type"
       content="text/html;charset=UTF-8" />
    <title>Hello World!</title>
  </head>
  <body>
    <h1>Hello World!</h1>
  </body>
</html>

This is handy as a starting point when creating new documents.

XML and HTML

HTML is great for structuring content - there are headers, tables, ordered/unordered lists, maps (definition lists - dl). All you need to make information accessible to human beings.

XML is great for automated processing - there are very strict rules, and extensive tools for checking, formatting querying and transforming information.

Character References

Not all characters are readily available on all keyboards. Also, once a character is entered into a text, it's not easy to see exactly which one it is - which is important if you're trying to search for it, e.g. using grep or the text editor search function.

It is surprisingly difficult to get this right for all possible contexts. Furthermore, it is quite easy to not realize that some context is not doing what's expected. Browsers, Parsers and humans all need to make sense of entities, each of them need the right input to understand what is going on. Examples: If you give an XML-parser an XHTML document with a ″ in it, the parser will complain. If you write "b<a" when you really mean "b is less than a", humans will understand, but parsers will complain.

Since the character reference is really just another way to type a character, to a parser, there is no difference between the reference and the character itself. This means care should be taken when e.g. formatting a source document - the formatter might replace the entities, which is a pain if the document needs to be edited manually (and that's kind of what "source" means).

How to safely reference special characters without confusing humans and parsers. Notice that XML only provides references for characters that might be mistaken for markup, while HTML provides a wide range of references for characters that might not be available on any keyboard (e.g. Danes abroad hardly ever find "ø" in an easy-to-reach place).
Name	Symbol	XML Reference	HTML Reference	Unicode (UCS) Reference
Ampersand	&	&	&	&
Double Quotation Mark	"	"	"	"
Apostrophe	'	'	'	'
Less-than	<	<	<	<
Greater-than	>	>	>	>
Prime (minutes/feet)	′		′	′
Double prime (seconds/inches)	″		″	″
Tilde	∼		&sim;	∼
Three Dots ( Horizontal Ellipsis)	…		…	…
Degrees	°		°	°
Ligature ae	Æ / æ		Æ / æ	Æ / æ
O-slash (oe)	Ø / ø		Ø / ø	Ø / ø
A-ring (aa)	Å / å		Å / å	Å / å
1/2 (Half Fraction)	½		½	&#00BD;

Named Entities

HTML and XML both define names for some characters, mostly the ones which are likely to clash with the reserved characters like <, >, "/' and &. Unfortunately, very few of these named entities are consistent between HTML and XML, so if you want to read an XHTML document as HTML and as XML depending on the context, named entities are not the way to go.

Links

There is no shortage of information about XHTML:

Percent Encoding in URIs

When URLs referenced in an HTML document contain reserved characters, care must be taken so they take on the proper meaning, both from the point of view of the document (which e.g. might assign a special meaning to the & character), and from the point of view of a browser trying to retrieve a document via a link (e.g. when a user clicks on a link in an HTML document).