Point, Charset, Match: Character Encoding in JavaScript

March 26, 2010

If you’re not familiar with the principles of character encoding, read the prerequisite Dive Into HTML 5 section on the subject.

When you see issues with Character Encoding, it’s traditionally in the form of text on your page that looks like this: in Firefox or in IE.

Usually, those characters mean that the character encoding used on the page is either ambiguous (not specified), or incorrect. We can use Firefox to determine that Character Encoding of a web page (Right Click and go to View Page Info; or use the “Character Encoding” entry in the View menu). Check to make sure that the encoding reported by Firefox is the same encoding used in your IDE. For example, Eclipse 3.5 has a “Set Encoding” option in the Edit menu.

The reason most English alphabetic and numeric characters are consistent independent of character encoding is due to consistency in the lower characters in each encoding. The characters making up the ASCII character set (0-127) are the same as the lowest 128 characters of ISO-8859-1, UTF-8, and others.

Managing your character encodings gets trickier as you add more architectural layers to your application. For example, character encodings may differ in your database, the properties files used to configure your application (java.util.Properties uses ISO-8859-1 by default), or maybe the XML or JSON file you’re loading from an external API.

Ever heard of HTML character entities? That’s the primary reason they exist — as a sort of encoding independent reference to a particular character. So, for example, the Œ character does not exist in the ISO-8859-1 character set. To display this character in a document with ISO-8859-1 encoding, use the equivalent HTML character entity: &OElig;. For an easier reference, check out this full table of HTML character entities. If using ISO-8859-1 for your HTML document, any entity above Unicode index 255 will need to be escaped. If you’re using UTF-8 encoding, HTML character entities shouldn’t be required.

Setting the Character Encoding

To specify the character encoding for any file, you can set a Content-Type header by configuring your web server or application. Apache lets you easily set different default character encodings for each individual file extension (.js for example). Using the Content-Type header is the most full proof and efficient 1 method to serve content.

But without access to the Apache configuration, how do we specify the character encoding?

For external JavaScript Files

In the HTML File

Just add the charset attribute. If not specified, the HTML document’s character encoding is used by default (specified in the Content-Type header or the appropriate meta tag, for example: <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>).

<script type="text/javascript" src="script.js" charset="utf-8"></script>

In the JavaScript File

To my knowledge, there is no way for a JavaScript file to report its own character encoding. To me, this seems like an omission. Each individual document should be able to report its own character encoding without a header. CSS files can do it (@charset at-rule). HTML files can do it (<meta> tag). Why not JavaScript files?

For Dynamically Created Script Tags

var s = document.createElement('script');
s.src = 'script.js';
s.type = 'text/javascript';
s.charset = 'utf-8';

If you’re using jQuery’s Ajax functions to load external JavaScript files, perhaps you might be inclined to use the script dataType. jQuery even provides a scriptCharset option for wrapping the above method for changing the charset on a dynamic script tag. Be warned, the jQuery Ajax function uses two different methods to load external script files (as of version 1.4.2). If a same-domain request, it uses an XMLHttpRequest. If a cross-domain request, it uses a dynamic script tag. So the scriptCharset jQuery option only applies to cross-domain requests. We’ll need some other method to mitigate our character encoding issues (or just use dynamic script tags).

For XMLHttpRequest Objects

Our saving grace would be the overrideMimeType method, if it weren’t poetically unavailable in Internet Explorer. Using this method, we can override the mime type and character encoding.

var xhr = new XMLHttpRequest();
// Not available in Internet Explorer (up to version 8 at time of writing)
if (xhr.overrideMimeType) {
	xhr.overrideMimeType('application/x-javascript; charset=utf-8');
}

Portable non-ASCII JavaScript

The best way to make non-ASCII characters in JavaScript files portable is to escape the characters properly. If the character is destined for HTML, use an HTML character entity (if available, not all Unicode or ISO-8559-1 characters have entities). Or, escape the characters using the proper Latin or Unicode escape sequence.

// Raw characters
var string = "ñó";

// HTML character entities
var string = "&ntilde;&oacute;";

// Escaped to Latin
var string = "xf1xf3";

// Escaped to Unicode
var string = "u00f1u00f3";

If you use the Google Closure Compiler, you’ll get the Unicode escape sequences for free (see issues 24 and 68). Make sure to read the tickets for more benefits of serving your JavaScript files using Unicode escape sequences to output only ASCII characters.

Summary

The easiest way to preemptively solve a lot of character encoding issues is to use UTF-8 for everything, and configure your web server/application to serve the UTF-8 Content-Type header. If you’re writing JavaScript code that you’re going to distribute to the masses, convert any non-ASCII characters using the proper escape sequences. Your JavaScript will be more portable, and will work out of the box on more server configurations.

Sources

Performance Implications of Charset, an article by Kyle Scholz

Zach Leatherman is a builder for the web at Font Awesome and the creator/maintainer of Eleventy (11ty), an award-winning open source site generator. At one point he became entirely too fixated on web fonts. He has given 85 talks in nine different countries at events like Beyond Tellerrand, Smashing Conference, Jamstack Conf, CSSConf, and The White House. Formerly part of CloudCannon, Netlify, Filament Group, NEJS CONF, and NebraskaJS. Learn more about Zach »