Point, Charset, Match: Character Encoding in JavaScript
If you’re not familiar with the principles of character encoding, read the prerequisite Dive Into HTML 5 section on the subject.
When you see issues with Character Encoding, it’s traditionally in the form of text on your page that looks like this: in Firefox or in IE.
Usually, those characters mean that the character encoding used on the page is either ambiguous (not specified), or incorrect. We can use Firefox to determine that Character Encoding of a web page (Right Click and go to View Page Info; or use the “Character Encoding” entry in the View menu). Check to make sure that the encoding reported by Firefox is the same encoding used in your IDE. For example, Eclipse 3.5 has a “Set Encoding” option in the Edit menu.
The reason most English alphabetic and numeric characters are consistent independent of character encoding is due to consistency in the lower characters in each encoding. The characters making up the ASCII character set (0-127) are the same as the lowest 128 characters of ISO-8859-1, UTF-8, and others.
Managing your character encodings gets trickier as you add more architectural layers to your application. For example, character encodings may differ in your database, the properties files used to configure your application (java.util.Properties uses ISO-8859-1 by default), or maybe the XML or JSON file you’re loading from an external API.
Ever heard of HTML character entities? That’s the primary reason they exist — as a sort of encoding independent reference to a particular character. So, for example, the Œ character does not exist in the ISO-8859-1 character set. To display this character in a document with ISO-8859-1 encoding, use the equivalent HTML character entity: Œ
. For an easier reference, check out this full table of HTML character entities. If using ISO-8859-1 for your HTML document, any entity above Unicode index 255 will need to be escaped. If you’re using UTF-8 encoding, HTML character entities shouldn’t be required.
Setting the Character Encoding
To specify the character encoding for any file, you can set a Content-Type
header by configuring your web server or application. Apache lets you easily set different default character encodings for each individual file extension (.js
for example). Using the Content-Type
header is the most full proof and efficient 1 method to serve content.
But without access to the Apache configuration, how do we specify the character encoding?
For external JavaScript Files
In the HTML File
Just add the charset
attribute. If not specified, the HTML document’s character encoding is used by default (specified in the Content-Type
header or the appropriate meta tag, for example: <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
).
<script type="text/javascript" src="script.js" charset="utf-8"></script>
In the JavaScript File
To my knowledge, there is no way for a JavaScript file to report its own character encoding. To me, this seems like an omission. Each individual document should be able to report its own character encoding without a header. CSS files can do it (@charset
at-rule). HTML files can do it (<meta>
tag). Why not JavaScript files?
For Dynamically Created Script Tags
var s = document.createElement('script');
s.src = 'script.js';
s.type = 'text/javascript';
s.charset = 'utf-8';
If you’re using jQuery’s Ajax functions to load external JavaScript files, perhaps you might be inclined to use the script dataType
. jQuery even provides a scriptCharset
option for wrapping the above method for changing the charset on a dynamic script tag. Be warned, the jQuery Ajax function uses two different methods to load external script files (as of version 1.4.2). If a same-domain request, it uses an XMLHttpRequest
. If a cross-domain request, it uses a dynamic script
tag. So the scriptCharset
jQuery option only applies to cross-domain requests. We’ll need some other method to mitigate our character encoding issues (or just use dynamic script tags).
For XMLHttpRequest Objects
Our saving grace would be the overrideMimeType
method, if it weren’t poetically unavailable in Internet Explorer. Using this method, we can override the mime type and character encoding.
var xhr = new XMLHttpRequest();
// Not available in Internet Explorer (up to version 8 at time of writing)
if (xhr.overrideMimeType) {
xhr.overrideMimeType('application/x-javascript; charset=utf-8');
}
Portable non-ASCII JavaScript
The best way to make non-ASCII characters in JavaScript files portable is to escape the characters properly. If the character is destined for HTML, use an HTML character entity (if available, not all Unicode or ISO-8559-1 characters have entities). Or, escape the characters using the proper Latin or Unicode escape sequence.
// Raw characters
var string = "ñó";
// HTML character entities
var string = "ñó";
// Escaped to Latin
var string = "xf1xf3";
// Escaped to Unicode
var string = "u00f1u00f3";
If you use the Google Closure Compiler, you’ll get the Unicode escape sequences for free (see issues 24 and 68). Make sure to read the tickets for more benefits of serving your JavaScript files using Unicode escape sequences to output only ASCII characters.
Summary
The easiest way to preemptively solve a lot of character encoding issues is to use UTF-8 for everything, and configure your web server/application to serve the UTF-8 Content-Type
header. If you’re writing JavaScript code that you’re going to distribute to the masses, convert any non-ASCII characters using the proper escape sequences. Your JavaScript will be more portable, and will work out of the box on more server configurations.
1 Comment
Shirley Mckeon Disqus
21 Jul 2011