Ga direct naar


Everything You Always Wanted To Know About Character Sets In Webdevelopment * But Were Afraid To Ask

Saturday 24 January 2009 21:07

This writing is about character set in the webdevelopment area. I will explain about character sets in general and then go into the specifics of character sets in the field of webdevelopment, focusing on PHP, MySQL, HTML, HTTP, XML, CSS, Javascript, and Ajax.

By Patrick van Bergen

Paraphrasing Woody Allen's 1972 movie has been done ad nauseam, I know, but this title just popped into my head and wouldn't leave :)

I will try to be as brief as possible. This writing is about character set in the webdevelopment area. I will explain about character sets in general and then go into the specifics of character sets in the field of webdevelopment, focusing on PHP, MySQL, HTML, HTTP, XML, CSS, Javascript, and Ajax.

Character sets need-to-know


For those of you who read this article with no knowledge of the matter, I will start by explaining what character sets are. A character set is a set of characters, like the letters "A", "B", "C" etc. Computers store these letters as numbers. The mapping from characters to numbers is called the "character encoding".

Traditionally the most important character set was that of ASCII, which mapped "A", "B", and "C" to 65, 66, and 67. The characters and special tokens are mapped to the numbers 0 - 127. This suited everyone just fine, as long as "everyone" was limited to the English speaking countries. It excluded characters as  ä, ß, and ç. Microsoft, always a forerunner in the field of internationalization, coined various "code pages" to deal with this issue. Each code page suiting a different language. Of these, Windows-1252 has survived and is still very much alive to this day. It contains all characters of western languages and still manages to fit all encoded numbers into a single byte (numbers 0-255).

Windows-1252 (a.k.a. CP1252) is actually based on ISO 8859-1 (better known as latin-1), but it contains some additions. For Europeans the most prominent lacking character of latin-1 is the euro sign. To this point ISO 8859-1 was later extended to ISO 8859-15. These three character sets are often confused.

 



Years later the Unicode standard was laid out to end all character set disputes. The standard is generally accepted and comprises all characters known to man. They do not fit into a single byte any more. How many bytes are used depends on the character encoding you choose. Of these encodings, UTF-8 and UTF-16 are most commonly used. UTF-8 uses 1 to 4 bytes per character, and is ASCII-compatible, which means that you can say that an ASCII string can be considered to be a UNICODE string encoded in UTF-8.

UCS (Universal Character Set), a.k.a. ISO 10646 was developed in parallel to Unicode. The character sets have converged over time. The character encodings of UCS and Unicode may now be considered identical.

Character sets in webdevelopment languages

 

HTML (starting with version 4.0)


HTML uses character set UCS with UTF-8 as a default encoding. Most webdevelopers don't know this, and you don't really need to care as long as your audience consists of Western countries and you "encode" any non-ASCII characters as character entities, like &, €, and ®. Furthermore, most browsers ignore the standard and take Windows-1252 as the default character set.

If you need to use many non-standard characters and you find that the entities make your source code impossible to read, you can specify this by inserting this HTML element in the page, using UTF-8 as an example:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

I may strike you as odd that the encoding of the document is inside the document. There is a chicken-and-egg problem here in detecting the encoding. The problem is most noteable if a character takes at least two bytes (as it does in UTF-16). For this purpose, UTF-16 or UTF-32 documents need to start with the Byte Order Mark (BOM), that is easily detectible by the browser.

HTTP


An HTML page is transmitted from the webserver to the webclient (browser) via the HTTP protocol. The webserver (Apache, for instance) sends some meta information about the file it serves in the header of the HTTP response. Along with the mimetype, it may send information about the character set:

Content-Type: text/html; charset=utf-8

This HTTP character set hint overrides the character set specification of the document. So make sure it is correct. Apache's AddDefaultCharset sets the character set to ISO 8859-1 which would make your UTF-8 documents unreadable.

XML


XML documents are built from UCS characters. You specify the encoding in the document type declaration:

 <?xml version="1.0" encoding="windows-1252"?>

PHP


Strings in PHP (up to version 5) do not have a character set. This will change in version 6 as the character set will set to Unicode using UTF-16 as the encoding. For the moment a string is simply a series of bytes. Since a byte is sufficient to store a Windows-1252 character and this happens to be both the default character set of MySQL and most browsers, it may be tempting to think that PHP uses Windows-1252 as well. However, once you start using the function "utf8_encode" to convert your string to UTF-8 you'll be surprised that the euro sign is not converted properly. utf8_encode presumes that the input string is in latin-1 (ISO-8859-1). Another function that performs character set conversion is iconv, but it is not very reliable. To convert Windows-1252 to UTF-8, use:

$newString = iconv('CP1252', 'UTF-8', $oldSting).

All string functions presume characters to fit in a single byte.

Java / C#


Strings in most modern languages use Unicode to encode the characters. They are encoded in UTF-16, internally.

Javascript


The official ECMAScript specification prescribes UTF-16 to be the string encoding standard. However, Javascript in a browser inherits the encoding of the document it is in. So if the HTML page is encoded in Windows-1252, this will hold for the Javascript text as well.

Ajax


If you send data from the server to the client via the XMLHttpRequest object (or one of its Javascript-framework wrappers) you need to be careful about the encoding of your data. The default encoding the Ajax object expects happens to be UTF-8. If your current encoding on the server and the client is different from UTF-8, you can do two things: 1. change the expected encoding of the XMLHttpRequest object, or 2. encode your data with UTF-8 before sending it to the client.

CSS


CSS uses the UCS as a character set. If you need to need to make the encoding of your stylesheet explicit you can do so in the HTML link element that links your stylesheet document:

<link charset="">

or place the specification inline of the stylesheet document using the charset rule:

@charset "ISO-8859-1";

MySQL


MySQL allows you to specify the character set per table, and even per field. If you don't specify the table's character set, MySQL will use the default character set as it is defined in the client ini-file (my.ini):

default-character-set=latin1

Of course it is better to state the character set explicitly. For example:

CREATE TABLE IF NOT EXISTS names (
    `id` INT NOT NULL auto_increment,
    `name` VARCHAR(255) NOT NULL,
    PRIMARY KEY (id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_general_ci;


The example also specifies the collation of the field. It determines how character strings will be ordered alphabetically, among others.

By "latin-1", the default character set, MySQL actually means "Windows-1252". They do know the difference, but the name has stayed anyway.

Transport

By executing a query you are sending an SQL query string from your application to the MySQL server. In response, MySQL sends a result string. Both strings have encodings. Both encodings are determined by MySQL system variables: "character_set_client" for client-to-server and "character_set_results" for server-to-client. These are system-wide variables. If you have no control over them, you must encode and decode your query and results before and after sending it to the server.

Afterword


Character sets will pose problems for some time. Here are some quick hints to keep you out of trouble:

  • In webdevelopment there are only two main character sets ruling the stage: Windows-1252 and Unicode(encoded in UTF-8 or UTF-16).
  • Presume that any string of characters has a given character encoding.
  • Find out which application in your software environment uses which character set and which character encoding.
  • If a document is transported from one place to another, make sure you control the the encoding used by the transport mechanism.
  • If you are totally losing sight on all the layers and conversions your document must go through before reaching its destination, consider converting all non-ASCII characters to Unicode escapes, for example \u01ef.

Which character set and encoding should you choose if you have the choice?

  • Unicode is the standard. Operating Systems have great support for Unicode.
  • Algorithms for working with UTF-16 strings are generally faster than UTF-8 algorithms. So use UTF-16 for storage.
  • UTF-8 strings are much shorter than UTF-16 strings. So use UTF-8 for transport.

Links

 

« Back

Reactions on "Everything You Always Wanted To Know About Character Sets In Webdevelopment * But Were Afraid To Ask"

combatwombat
Placed on: 03-31-2009 09:58
Great article. I too have found the pain of Mysql encoding, but found how to clean up from the mess and convert to utf-8. Blogged here -- http://combatwombat.7dov...l-latin1-to-utf8-issues

Thanks for the UTF-8 vs 16 explanation too.
Search Engine Training
Placed on: 11-19-2012 10:00
The post is written in a very good manner and it entails many useful information for me.
I am glad to find your distinguished way of writing the post. Now you make it easy for me to understand and implement the concept.
http://www.search-engine-training.co.uk
http://www.search-engine...o-training-courses.html

Log in to comment on news articles.

Procurios zoekt PHP webdevelopers. Werk aan het Procurios Webplatform en klantprojecten! Zie http://www.slimmerwerkenbijprocurios.nl/.


Hello!

We are employees at Procurios, a full-service webdevelopment company located in the Netherlands. We are experts at building portals, websites, intranets and extranets, based on an in-house developed framework. You can find out more about Procurios and our products, might you be interested.

This weblog is built and maintained by us. We love to share our ideas, thoughts and interests with you through our weblog. If you want to contact us, please feel free to use the contact form!


Showcase

  • Klantcase: Bestseller
  • Klantcase: de ChristenUnie
  • Klantcase: Evangelische Omroep
  • Klantcase: de Keurslager
  • Klantcase: New York Pizza
  • Klantcase: Verhage

Snelkoppelingen