Everything You Always Wanted To Know About Character Sets In Webdevelopment * But Were Afraid To Ask
Paraphrasing Woody Allen's 1972 movie has been done ad nauseam, I know, but this title just popped into my head and wouldn't leave :)
Character sets need-to-know
For those of you who read this article with no knowledge of the matter, I will start by explaining what character sets are. A character set is a set of characters, like the letters "A", "B", "C" etc. Computers store these letters as numbers. The mapping from characters to numbers is called the "character encoding".
Traditionally the most important character set was that of ASCII, which mapped "A", "B", and "C" to 65, 66, and 67. The characters and special tokens are mapped to the numbers 0 - 127. This suited everyone just fine, as long as "everyone" was limited to the English speaking countries. It excluded characters as ä, ß, and ç. Microsoft, always a forerunner in the field of internationalization, coined various "code pages" to deal with this issue. Each code page suiting a different language. Of these, Windows-1252 has survived and is still very much alive to this day. It contains all characters of western languages and still manages to fit all encoded numbers into a single byte (numbers 0-255).
Windows-1252 (a.k.a. CP1252) is actually based on ISO 8859-1 (better known as latin-1), but it contains some additions. For Europeans the most prominent lacking character of latin-1 is the euro sign. To this point ISO 8859-1 was later extended to ISO 8859-15. These three character sets are often confused.
Years later the Unicode standard was laid out to end all character set disputes. The standard is generally accepted and comprises all characters known to man. They do not fit into a single byte any more. How many bytes are used depends on the character encoding you choose. Of these encodings, UTF-8 and UTF-16 are most commonly used. UTF-8 uses 1 to 4 bytes per character, and is ASCII-compatible, which means that you can say that an ASCII string can be considered to be a UNICODE string encoded in UTF-8.
UCS (Universal Character Set), a.k.a. ISO 10646 was developed in parallel to Unicode. The character sets have converged over time. The character encodings of UCS and Unicode may now be considered identical.
Character sets in webdevelopment languages
HTML (starting with version 4.0)
HTML uses character set UCS with UTF-8 as a default encoding. Most webdevelopers don't know this, and you don't really need to care as long as your audience consists of Western countries and you "encode" any non-ASCII characters as character entities, like &, €, and ®. Furthermore, most browsers ignore the standard and take Windows-1252 as the default character set.
If you need to use many non-standard characters and you find that the entities make your source code impossible to read, you can specify this by inserting this HTML element in the page, using UTF-8 as an example:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
I may strike you as odd that the encoding of the document is inside the document. There is a chicken-and-egg problem here in detecting the encoding. The problem is most noteable if a character takes at least two bytes (as it does in UTF-16). For this purpose, UTF-16 or UTF-32 documents need to start with the Byte Order Mark (BOM), that is easily detectible by the browser.
- HTML specification on character sets
- Character sets in HTML
- MSDN; character set recognition in Internet Explorer
An HTML page is transmitted from the webserver to the webclient (browser) via the HTTP protocol. The webserver (Apache, for instance) sends some meta information about the file it serves in the header of the HTTP response. Along with the mimetype, it may send information about the character set:
Content-Type: text/html; charset=utf-8
This HTTP character set hint overrides the character set specification of the document. So make sure it is correct. Apache's AddDefaultCharset sets the character set to ISO 8859-1 which would make your UTF-8 documents unreadable.
- AddDefaultCharset is bad, bad, bad
- w3.org: Character sets & encodings in XHTML, HTML and CSS
- Setting charset information in .htaccess
XML documents are built from UCS characters. You specify the encoding in the document type declaration:
<?xml version="1.0" encoding="windows-1252"?>
Strings in PHP (up to version 5) do not have a character set. This will change in version 6 as the character set will set to Unicode using UTF-16 as the encoding. For the moment a string is simply a series of bytes. Since a byte is sufficient to store a Windows-1252 character and this happens to be both the default character set of MySQL and most browsers, it may be tempting to think that PHP uses Windows-1252 as well. However, once you start using the function "utf8_encode" to convert your string to UTF-8 you'll be surprised that the euro sign is not converted properly. utf8_encode presumes that the input string is in latin-1 (ISO-8859-1). Another function that performs character set conversion is iconv, but it is not very reliable. To convert Windows-1252 to UTF-8, use:
$newString = iconv('CP1252', 'UTF-8', $oldSting).
All string functions presume characters to fit in a single byte.
Java / C#
Strings in most modern languages use Unicode to encode the characters. They are encoded in UTF-16, internally.
CSS uses the UCS as a character set. If you need to need to make the encoding of your stylesheet explicit you can do so in the HTML link element that links your stylesheet document:
or place the specification inline of the stylesheet document using the charset rule:
MySQL allows you to specify the character set per table, and even per field. If you don't specify the table's character set, MySQL will use the default character set as it is defined in the client ini-file (my.ini):
Of course it is better to state the character set explicitly. For example:
CREATE TABLE IF NOT EXISTS names (
`id` INT NOT NULL auto_increment,
`name` VARCHAR(255) NOT NULL,
PRIMARY KEY (id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_general_ci;
The example also specifies the collation of the field. It determines how character strings will be ordered alphabetically, among others.
By "latin-1", the default character set, MySQL actually means "Windows-1252". They do know the difference, but the name has stayed anyway.
By executing a query you are sending an SQL query string from your application to the MySQL server. In response, MySQL sends a result string. Both strings have encodings. Both encodings are determined by MySQL system variables: "character_set_client" for client-to-server and "character_set_results" for server-to-client. These are system-wide variables. If you have no control over them, you must encode and decode your query and results before and after sending it to the server.
Character sets will pose problems for some time. Here are some quick hints to keep you out of trouble:
- In webdevelopment there are only two main character sets ruling the stage: Windows-1252 and Unicode(encoded in UTF-8 or UTF-16).
- Presume that any string of characters has a given character encoding.
- Find out which application in your software environment uses which character set and which character encoding.
- If a document is transported from one place to another, make sure you control the the encoding used by the transport mechanism.
- If you are totally losing sight on all the layers and conversions your document must go through before reaching its destination, consider converting all non-ASCII characters to Unicode escapes, for example \u01ef.
Which character set and encoding should you choose if you have the choice?
- Unicode is the standard. Operating Systems have great support for Unicode.
- Algorithms for working with UTF-16 strings are generally faster than UTF-8 algorithms. So use UTF-16 for storage.
- UTF-8 strings are much shorter than UTF-16 strings. So use UTF-8 for transport.
- The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) ~ Joel Spolsky