unicode

Links

See also: ECMAScript Internationalization API
Unicode-aware regular expressions in ECMAScript 6
Unicode Supplementary Characters for ECMAScript (6+?)
- Unicode Code Point Escape Sequences
  - \u{ plus one to six hexadecimal digits plus }, and contribute one or two code units.
- Other Text Processing Functions
  - e.g. toLowerCase
- Regular Expressions
- Code Point Based String Accessors
JavaScript’s internal character encoding: UCS-2 or UTF-16?
- Good read.
- UTF-16 is an extension of UCS-2 that allows representing code points outside the BMP. It produces a variable-length result of either one or two 16-bit code units per code point. This way, it can encode code points in the range from 0 to 0x10FFFF while UCS-2 is limited to 0 to 0xFFFF.
- Surrogate pairs: Characters outside the BMP, e.g. U+1D306 tetragram for centre (𝌆), can only be encoded in UTF-16 using two 16-bit code units: 0xD834 0xDF06. This is called a surrogate pair. Note that a surrogate pair only represents a single character.
- The first code unit of a surrogate pair is always in the range from 0xD800 to 0xDBFF, and is called a high surrogate or a lead surrogate.
- The second code unit of a surrogate pair is always in the range from 0xDC00 to 0xDFFF, and is called a low surrogate or a trail surrogate.
- UCS-2 lacks the concept of surrogate pairs, and therefore interprets 0xD834 0xDF06 (the previous UTF-16 encoding) as two separate characters.
- JavaScript treats code units as individual characters, while humans generally think in terms of Unicode characters. This has some unfortunate consequences for Unicode characters outside the BMP. Since surrogate pairs consist of two code units, '𝌆'.length == 2, even though there’s only one Unicode character there. The individual surrogate halves are being exposed as if they were characters: '𝌆' == \uD834\uDF06.
- This UCS-2-like behavior affects the entire language — for example, regular expressions for ranges of supplementary characters are much harder to write than in languages that do support UTF-16.
- Surrogate pairs are only recombined into a single Unicode character when they’re displayed by the browser (during layout).
- If you want to count the number of Unicode characters in a JavaScript string, or create a string based on a non-BMP Unicode code point, you could use Punycode.js’s utility functions to convert between UCS-2 strings and UTF-16 code points: