Links
- My Sites Page
- Note about
ISO-8859-1
akaLatin-1
.- It is the only encoding whose 256 characters are identical to the 256 first characters of Unicode.
- From: JavaScript’s internal character encoding: UCS-2 or UTF-16?
- Unicode identifies characters by an unambiguous name
and an integer number called its code point. For
example, the © character is named “copyright sign”
and has
U+00A9
—0xA9
can be written as169
in decimal—as its code point. - The Unicode code space is divided into seventeen
planes of 2^16 (65,536) code points each. Some of
these code points have not yet been assigned
character values, some are reserved for private use,
and some are permanently reserved as non-characters.
The code points in each plane have the hexadecimal
values
xy0000
toxyFFFF
, wherexy
is a hex value from00
to10
, signifying which plane the values belong to. - The first plane (
xy
is00
) is called the Basic Multilingual Plane or BMP. It contains the code points fromU+0000
toU+FFFF
, which are the most frequently used characters. - The other sixteen planes are called supplementary planes or astral planes.
- Unicode identifies characters by an unambiguous name
and an integer number called its code point. For
example, the © character is named “copyright sign”
and has
- UTF-8: Bits, Bytes, and Benefits
- UTF-16 and UCS-2 (from JavaScript’s internal character encoding: UCS-2 or UTF-16?)
- UTF-16 is an extension of UCS-2 that allows representing code points outside the BMP. It produces a variable-length result of either one or two 16-bit code units per code point. This way, it can encode code points in the range from 0 to 0x10FFFF while UCS-2 is limited to 0 to 0xFFFF.
- Surrogate pairs:
Characters outside the BMP, e.g. U+1D306 tetragram
for centre (𝌆), can only be encoded in UTF-16 using
two 16-bit code units:
0xD834 0xDF06
. This is called a surrogate pair. Note that a surrogate pair only represents a single character. - The first code unit of a surrogate pair is always in the range from 0xD800 to 0xDBFF, and is called a high surrogate or a lead surrogate.
- The second code unit of a surrogate pair is always in the range from 0xDC00 to 0xDFFF, and is called a low surrogate or a trail surrogate.
- UCS-2 lacks the concept of surrogate pairs, and therefore interprets 0xD834 0xDF06 (the previous UTF-16 encoding) as two separate characters.
- Unicode blocks
- Confusables
Pattern_Syntax
- Characters by groups
- Emoji
- FAQ on normalization
- Unicode normalization forms
- Wikipedia – Unicode equivalence
- elazarl/javaUnicodePitfalls
- Security
- Newlines
- Word separator
- Word-separator characters include the space (U+0020), the no-break space (U+00A0), the Ethiopic word space (U+1361), the Aegean word separators (U+10100,U+10101), the Ugaritic word divider (U+1039F), and the Phoenician Word Separator (U+1091F). If there are no word-separator characters, or if a word-separating character has a zero advance width (such as the zero width space U+200B) then the user agent must not create an additional spacing between words. General punctuation and fixed-width spaces (such as U+3000 and U+2000 through U+200A) are not considered word-separator characters.
- Official IANA names for encodings such as utf-8
- The character set names may be up to 40 characters taken from the
printable characters of US-ASCII. However, no distinction is made
between use of upper and lower case letters.
- The character set names may be up to 40 characters taken from the
- opentag.com - XML FAQ: Encoding
- http://joelonsoftware.com/articles/Unicode.html
- Supplementary characters
- e.g. the first character of “
𠮷野家
” (Yoshinoya) is a supplementary character. - Code points are not the right abstraction for text processing. Grapheme clusters (the substrings that a user would perceive as a character) are likely better.
- e.g. the first character of “
- Wikipedia
- Ready-made versus composite characters
- Combining character
- Precomposed character
- These two lines should render the same (the Swedish name Åström)
- Precomposed:
Åström (U+00C5 U+0073 U+0074 U+0072 U+00F6 U+006D)
- Combined:
Åström (U+0041 U+030A U+0073 U+0074 U+0072 U+006F U+0308 U+006D)
- Precomposed:
- So should these two:
- Precomposed:
ḱṷṓn (U+1E31 U+1E77 U+1E53 U+006E)
- Combined:
ḱṷṓn (U+006B U+0301 U+0075 U+032D U+006F U+0304 U+0301 U+006E)
- Precomposed:
- These two lines should render the same (the Swedish name Åström)
- Unicode equivalence
- Variation selectors
- Glyph vs. Character
- More
- Characters
- Non-breaking space
- wikipedia: Non-breaking space
- Keyboard entry:
- Mac:
Option-Space
- Vim:
Ctrl-K Space Space
- Emacs:
Ctrl+X 8 Space
- Mac:
- Special characters
- Specials (wikipedia)
- U+FFF9, interlinear annotation anchor, marks start of annotated text
- U+FFFA, interlinear annotation separator, marks start of annotating text
- U+FFFB, interlinear annotation terminator, marks end of annotating text
- U+FFFC, object replacement character,
placeholder in the text for another
unspecified object, for example in a
compound document.
- This one appears to behave like a "negative" character in gvim mac and the vertical split character on that line (if you have a vertical split window) line will shift left 1 character.
- U+FFFD � replacement character used to replace an unknown or unprintable character
- Specials (wikipedia)
- Non-breaking space
HowTo
Specify the encoding of files
Python source files
#!/usr/bin/env python # -*- coding: UTF-8 -*-
XML source files
The default encoding if nothing else is specified is utf-8.
If no encoding declaration is present in the XML document (and no external encoding declaration mechanism such as the HTTP header is available), the assumed encoding of an XML document depends on the presence of the Byte-Order-Mark (BOM).
The BOM is a Unicode special marker placed at the top of the file that indicate its encoding. The BOM is optional for UTF-8.
<?xml version="1.0" encoding="iso-8859-1"?>
HTML source files
When HTML documents are served there are three ways to tell the browser what specific character encoding is to be used for display to the reader. First, HTTP headers can be sent by the web server along with each web page (HTML document). A typical HTTP header looks like this:
Content-Type: text/html; charset=ISO-8859-1
For HTML (not usually XHTML), the other method is for the HTML document to include this information at its top, inside the HEAD
element.
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
XHTML documents have a third option: to express the character encoding in the XML preamble, for example
<?xml version="1.0" encoding="ISO-8859-1"?>
The HTTP header specification supersedes all HTML (or XHTML) meta tag specifications, which can be a problem if the header is incorrect and one does not have the access or the knowledge to change them.
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/> <html lang="zh-CN">
You can also not declare the encoding, keep it 7 bit, and use NNNN; html/xml entity references for the unicode code points. Python's encoding functions can accept errors="xmlcharrefreplace" to do that instead of sending across utf-8. However, this is not so great because editors can't show you a readable representation of such a file, it takes too much space, and there's great support for utf8 since IE 5.0
CSS source files
The encoding of a CSS file is determined according the following rules:
- If the file uses HTTP: By the HTTP
charset
parameter in theContent-Type
field. - By the value for the
@charset
command at the top of the CSS file. - By the declaration mechanism of the referencing document, if one exists. For example in XHTML: the
charset
attribute of the<link>
element.
For example, to specify iso-8859-1 (Latin-1) encoding:
@charset "iso-8859-1"
Lua source files
See http://lua-users.org/wiki/LuaUnicode
Basically, you can have LUA source code. However, you can't use unicode for any LUA identifiers (since LUA uses isalpha, etc. for identifying those). There's also no unicode aware string operations provided. It's mostly just blind to it.