I read Joel's article about character sets and so I'm taking his advice to use UTF-8 on my web page and in my database. What I can't understand is what to do with user input. As Joel says, "It does not make sense to have a string without knowing what encoding it uses." But how do I know what encoding the user input string uses? If I have
<input type="text" name="atextfield" >
on my page, how do I know what encoding I'm getting from the user? What if the user puts in some special ASCII symbol, like ♣ or ™ or something? Is there some way I can detect that user input gave me something unrecognized in UTF-8? Is there some standard for how to handle this sort of thing?
Check the HTTP headers to discover the character encoding.
If your web-page using UTF-8, browser will convert to UTF-8 for you. So, even the special characters are in ASCII it will submit as UTF-8.
However, you never know itchy hand from an user that switch back the page encoding to ISO-8859-*.
You can make use on mb_detect_encoding, but is not 100% bullet-proof.
/* Detect character encoding with current detect_order */
/* "auto" is expanded to "ASCII,JIS,UTF-8,EUC-JP,SJIS" */
echo mb_detect_encoding($str, "auto");
/* Specify encoding_list character encoding by comma separated list */
echo mb_detect_encoding($str, "JIS, eucjp-win, sjis-win");
/* Use array to specify encoding_list */
$ary = "ASCII";
$ary = "JIS";
$ary = "EUC-JP";
echo mb_detect_encoding($str, $ary);
Don't try to detect, convert all user-inputed text to UTF-8 in your application. You can do all you can on your side, by configuring your webserver to send UTF-8 pages and UTF-8 headers, configure your application to handle all text in UTF-8, tweak your filesystem (if necessary) to handle text files as UTF-8, configure your database, but you simply have no real control on the user end. You can suggest the proper character encoding in your html forms, like the following, but it's not really enforceable on the user end:
<form action="/index.php" method="post" accept-charset="UTF-8"></form>
Unless detecting the encoding of the user input is the whole purpose of your application, it's a fools errand to try. Assume the encoding is wrong and convert it to UTF-8 in your app. Just as you should assume your user input is malicious and clean it up before you attempt to insert it into your database.
In most languages that have UTF-8 properly implemented, ASCII characters will survive conversion, so don't worry about that either.