XHTML
Combining the content-semantics of HTML with the structure of XML, XHTML is great if you want to automate processing of information which is destined for human consumption.
Sample XHTML Document
The simplest possible XHTML document:
<?xml version="1.0" encoding="utf-8" standalone="no"?> <html lang="en" xml:lang="en" xsi:schemaLocation="http://www.w3.org/1999/xhtml http://www.w3.org/2002/08/xhtml/xhtml1-strict.xsd"> <head> <meta http-equiv="Content-Type" content="text/html;charset=UTF-8" /> <title>Hello World!</title> </head> <body> <h1>Hello World!</h1> </body> </html>
This is handy as a starting point when creating new documents.
XML and HTML
HTML is great for structuring content - there are headers, tables, ordered/unordered lists, maps (definition lists - dl). All you need to make information accessible to human beings.
XML is great for automated processing - there are very strict rules, and extensive tools for checking, formatting querying and transforming information.
Character References
Not all characters are readily available on all keyboards. Also, once a character is entered into a text, it's not easy to see exactly which one it is - which is important if you're trying to search for it, e.g. using grep or the text editor search function.
It is surprisingly difficult to get this right for all possible contexts. Furthermore, it is quite easy to not realize that some context is not doing what's expected. Browsers, Parsers and humans all need to make sense of entities, each of them need the right input to understand what is going on. Examples: If you give an XML-parser an XHTML document with a ″ in it, the parser will complain. If you write "b<a" when you really mean "b is less than a", humans will understand, but parsers will complain.
Since the character reference is really just another way to type a character, to a parser, there is no difference between the reference and the character itself. This means care should be taken when e.g. formatting a source document - the formatter might replace the entities, which is a pain if the document needs to be edited manually (and that's kind of what "source" means).
Name | Symbol | XML Reference | HTML Reference | Unicode (UCS) Reference |
---|---|---|---|---|
Ampersand | & | & | & | & |
Double Quotation Mark | " | " | " | " |
Apostrophe | ' | ' | ' | ' |
Less-than | < | < | < | < |
Greater-than | > | > | > | > |
Prime (minutes/feet) | ′ | ′ | ′ | |
Double prime (seconds/inches) | ″ | ″ | ″ | |
Tilde | ∼ | ∼ | ∼ | |
Three Dots ( Horizontal Ellipsis) | … | … | … | |
Degrees | ° | ° | ° | |
Ligature ae | Æ / æ | Æ / æ | Æ / æ | |
O-slash (oe) | Ø / ø | Ø / ø | Ø / ø | |
A-ring (aa) | Å / å | Å / å | Å / å | |
1/2 (Half Fraction) | ½ | ½ | �BD; |
Named Entities
HTML and XML both define names for some characters, mostly the ones which are likely to clash with the reserved characters like <, >, "/' and &. Unfortunately, very few of these named entities are consistent between HTML and XML, so if you want to read an XHTML document as HTML and as XML depending on the context, named entities are not the way to go.
Links
There is no shortage of information about XHTML:
- Wikipedia on XHTML
- The XHTML 1 Specification from the W3C
- w3schools on HTML and XHTML
- XML and HTML entity references
Percent Encoding in URIs
When URLs referenced in an HTML document contain reserved characters, care must be taken so they take on the proper meaning, both from the point of view of the document (which e.g. might assign a special meaning to the & character), and from the point of view of a browser trying to retrieve a document via a link (e.g. when a user clicks on a link in an HTML document).