unicode
Links
- See also: ECMAScript Internationalization API
- Unicode-aware regular expressions in ECMAScript 6
- Unicode Supplementary Characters for ECMAScript (6+?)
- JavaScript’s internal character encoding: UCS-2 or UTF-16?
- Good read.
- UTF-16 is an
extension of UCS-2 that allows representing code
points outside the BMP. It produces a
variable-length result of either one or two 16-bit
code units per code point. This way, it can encode
code points in the range from 0 to 0x10FFFF while
UCS-2 is limited to 0 to 0xFFFF.
- Surrogate pairs:
Characters outside the BMP, e.g. U+1D306 tetragram
for centre (𝌆), can only be encoded in UTF-16 using
two 16-bit code units:
0xD834 0xDF06
. This is called
a surrogate pair. Note that a surrogate pair only
represents a single character.
- The first code unit of a surrogate pair is always in
the range from 0xD800 to 0xDBFF, and is called a
high surrogate or a lead surrogate.
- The second code unit of a surrogate pair is always
in the range from 0xDC00 to 0xDFFF, and is called a
low surrogate or a trail surrogate.
- UCS-2 lacks the concept of surrogate pairs, and
therefore interprets 0xD834 0xDF06 (the previous
UTF-16 encoding) as two separate characters.
- JavaScript treats code units as individual
characters, while humans generally think in terms
of Unicode characters. This has some unfortunate
consequences for Unicode characters outside the BMP.
Since surrogate pairs consist of two code units,
'𝌆'.length == 2, even though there’s only one
Unicode character there. The individual surrogate
halves are being exposed as if they were characters:
'𝌆' ==
\uD834\uDF06
.
- This UCS-2-like behavior affects the entire language
— for example, regular expressions for ranges of
supplementary
characters are
much harder to write than in languages that do
support UTF-16.
- Surrogate pairs are only recombined into a single
Unicode character when they’re displayed by the
browser (during layout).
- If you want to count the number of Unicode
characters in
a JavaScript string, or create a string based on a
non-BMP Unicode code point, you could
use Punycode.js’s
utility functions to convert between UCS-2 strings
and UTF-16 code points: