Not definitive at all, I'm still confused and grasping for answers.
Some of this depends on access to your server configuration (or on having a nice host)
The presentation is much longer than it should've been, but it's now very late and I must sleep.
My output is mashed! What are all of these little boxes?
What inconsiderate person has chosen to use non-English text in my app!
I won't be looking at localisation (L10N) at all, but at a more basic level of character representation; this could help you more than it may sound like it will. Possibly.
A representation of a set of characters
Letters map to code points; representation of code points is left up to encodings.
... --- ...
ASCII - represents every unaccented English letter, numeral, and some punctuation and control characters with a number between 1 and 128; encoded as 7 bit binary digit.
This encoding leaves a bit spare to be used in all sorts of snazzy ways...
iso-8859-1 (latin-1); Western European (accents etc.)
Windows CP1252; very similar to iso-8859-1, 27 differences to catch the unwary; problems pasting from Word into text areas? Mmmmm, dig the smart quotes.
A single character set to include every possible character (6' x 12' posters available).
An encoding for storing Unicode (ie. every character) code points in memory with 8 bit bytes.
Code points 0 - 127 stored in a single byte; the same as ASCII; efficient for English text
Code points above 128 are stored with 1-6 bytes.
So, we'll use UTF-8 for everything, problem solved … doubles all round!
Be aware of the encodings that you're dealing with, specify UTF-8 to browsers and use their handling capabilities where possible.
Save your documents in utf-8 encoding.
For GNU Emacs:
;; utf-8 encoding as default
(setq locale-coding-system 'utf-8)
(set-terminal-coding-system 'utf-8)
(set-keyboard-coding-system 'utf-8)
(prefer-coding-system 'utf-8)
in your .emacs
Use a meta tag to set a header (It must be the first tag in the head, the browser will back up on reaching it and re-scan from the start of the document):
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
Forms should use an accept-charset attribute:
<form accept-charset="utf-8">
This works as a hint to the browser to send us UTF-8 encoded data.
Not bullet-proof, especially for application/x-www-form-urlencoded data.
But hey, we do what can we do …
Specify the encoding in your XML prolog:
<?xml version="1.0" encoding="utf-8"?>
UTF-8 is the default for documents that don't specify an encoding (and that don't have a BOM ... whole other story)
Default httpd.conf:
AddDefaultCharset On
Adds iso-8859-1 charset header to all text/plain, text/html docs except standard error pages; given a charset apt to their content.
This overrides the charset we specified in our HTML!
We need to specify:
AddDefaultCharset utf-8
eg (in the virtual host definition)
<VirtualHost *>
ServerName roxxor
DocumentRoot /home/jon/blah
AddDefaultCharset utf-8
php_value default_charset utf-8
php_value mbstring.internal_encoding utf-8
</VirtualHost>
PHP values explained later, this is a handy place to set them …
utf-8 is utf8 in MySQL.
Collation: sort order, may be case sensitive or not
To find out your current setup:
SHOW VARIABLES LIKE 'character_set_database';
SHOW VARIABLES LIKE 'character_set_client';
To see available character sets and collations
SHOW CHARACTER SET;
SHOW COLLATION LIKE 'utf8%';
We can set character set and collation per server, database, table, connection;
Server (/etc/my.cnf):
[mysqld]
...
default-character-set=utf8
default-collation=utf8_general_ci
Database:
(CREATE | ALTER) DATABASE ... DEFAULT CHARACTER SET utf8
Table:
(CREATE | ALTER) DATABASE ... DEFAULT CHARACTER SET utf8
Connection:
SET NAMES 'utf8';
According to Bertrand Mansion this can be set on a server-wide basis in your my.cnf:
[mysqld]
init_connect="SET NAMES utf8"
however, this will not apply if you connect as the MySQL root user.
PHP mysql connection (I think) defaults to a latin1 connection, so, first query after connection:
mysql_query("SET NAMES 'utf8'");
CONVERT() function for converting between charsets, eg:
INSERT INTO utf8table (utf8column)
SELECT CONVERT(latin1field USING utf8)
FROM latin1table;
Consider field sizes - chars may be up to 6 times wider, field sizes need to allow for this.
Support is, ahem, a bit lacking at the moment; inbuilt expectation of latin-1.
UTF-8 characters represented by 1-6 bytes, but some PHP functions don't know about this; they see
character == byte
However, all is not lost. And better Unicode support is promised for PHP5.
Need to set
default_charset = utf-8
Setting is PHP_INI_ALL, so ini_set()'ll do it. See above for apache setting.
For MySQL connections, as above, first query after connection:
mysql_query("SET NAMES 'utf8'");
PCRE, use u modifier for unicode
$pattern = "/pattern/u";
When using htmlspecialchars, supply encoding as 3rd param:
htmlspecialchars($string, ENT_QUOTES, 'utf-8');
Not sure about the need for this …
String handling functions assume character == byte, so won't work correctly with multibyte encoded character sets.
Use mbstring extension if available; replaces string functions (eg. strlen) with multibyte aware equivalents (eg. mb_strlen).
If using mbstring, set internal_encoding:
mbstring.internal_encoding utf-8
php.net says that the default for the internal encoding is null, but my setup seemed to default to iso-8859-1.
mbstring.func_overload setting looks good;
"Overloads a set of single byte functions by the mbstring counterparts."
Unfortunately stability may be an issue at present.
iconv extension for conversion between character sets.
Useful for conversion of existing data, or where you can't control input data charset (eg. feeds)??
See also, mb_convert_encoding().
Of course, all of this also applies to any libraries/programs that you're using... Aaarrgghh!
Only scratched the surface, but hopefully enough to give you some ideas/nightmares.