unicode

Links

My Sites Page
Note about ISO-8859-1 aka Latin-1.
- It is the only encoding whose 256 characters are identical to the 256 first characters of Unicode.
From: JavaScript’s internal character encoding: UCS-2 or UTF-16?
- Unicode identifies characters by an unambiguous name and an integer number called its code point. For example, the © character is named “copyright sign” and has U+00A9—0xA9 can be written as 169 in decimal—as its code point.
- The Unicode code space is divided into seventeen planes of 2^16 (65,536) code points each. Some of these code points have not yet been assigned character values, some are reserved for private use, and some are permanently reserved as non-characters. The code points in each plane have the hexadecimal values xy0000 to xyFFFF, where xy is a hex value from 00 to 10, signifying which plane the values belong to.
- The first plane (xy is 00) is called the Basic Multilingual Plane or BMP. It contains the code points from U+0000 to U+FFFF, which are the most frequently used characters.
- The other sixteen planes are called supplementary planes or astral planes.
UTF-8: Bits, Bytes, and Benefits
UTF-16 and UCS-2 (from JavaScript’s internal character encoding: UCS-2 or UTF-16?)
- UTF-16 is an extension of UCS-2 that allows representing code points outside the BMP. It produces a variable-length result of either one or two 16-bit code units per code point. This way, it can encode code points in the range from 0 to 0x10FFFF while UCS-2 is limited to 0 to 0xFFFF.
- Surrogate pairs: Characters outside the BMP, e.g. U+1D306 tetragram for centre (𝌆), can only be encoded in UTF-16 using two 16-bit code units: 0xD834 0xDF06. This is called a surrogate pair. Note that a surrogate pair only represents a single character.
- The first code unit of a surrogate pair is always in the range from 0xD800 to 0xDBFF, and is called a high surrogate or a lead surrogate.
- The second code unit of a surrogate pair is always in the range from 0xDC00 to 0xDFFF, and is called a low surrogate or a trail surrogate.
- UCS-2 lacks the concept of surrogate pairs, and therefore interprets 0xD834 0xDF06 (the previous UTF-16 encoding) as two separate characters.
Unicode blocks
Confusables
Pattern_Syntax
Characters by groups
Emoji
- Emoji Symbols: Background Data
- Emoji in the Unicode Standard (Wikipedia)
FAQ on normalization
- Unicode normalization forms
- Wikipedia – Unicode equivalence
elazarl/javaUnicodePitfalls
Security
- Network.IDN.blacklist chars - MozillaZine
- IDN in Google Chrome
- UTR #36: Unicode Security Considerations
- See also: Creative usernames and Spotify account hijacking
  - ᴮᴵᴳᴮᴵᴿᴰ
- Unicode Security Considerations
- Unicode Security Mechanisms.
Newlines
- Newline Characters
Word separator
- Word-separator characters include the space (U+0020), the no-break space (U+00A0), the Ethiopic word space (U+1361), the Aegean word separators (U+10100,U+10101), the Ugaritic word divider (U+1039F), and the Phoenician Word Separator (U+1091F). If there are no word-separator characters, or if a word-separating character has a zero advance width (such as the zero width space U+200B) then the user agent must not create an additional spacing between words. General punctuation and fixed-width spaces (such as U+3000 and U+2000 through U+200A) are not considered word-separator characters.
Official IANA names for encodings such as utf-8
- The character set names may be up to 40 characters taken from the
  printable characters of US-ASCII. However, no distinction is made
  between use of upper and lower case letters.
opentag.com - XML FAQ: Encoding
http://joelonsoftware.com/articles/Unicode.html
Supplementary characters
- e.g. the first character of “𠮷野家” (Yoshinoya) is a supplementary character.
- Code points are not the right abstraction for text processing. Grapheme clusters (the substrings that a user would perceive as a character) are likely better.
Wikipedia
- Ready-made versus composite characters
- Combining character
- Precomposed character
  - These two lines should render the same (the Swedish name Åström)
    - Precomposed: Åström (U+00C5 U+0073 U+0074 U+0072 U+00F6 U+006D)
    - Combined: Åström (U+0041 U+030A U+0073 U+0074 U+0072 U+006F U+0308 U+006D)
  - So should these two:
    - Precomposed: ḱṷṓn (U+1E31 U+1E77 U+1E53 U+006E)
    - Combined: ḱṷṓn (U+006B U+0301 U+0075 U+032D U+006F U+0304 U+0301 U+006E)
- Unicode equivalence
- Variation selectors
  - Glyph vs. Character
More
- As Yet Unsupported Scripts
- Character Code Charts
  - ASCII char (pdf)
  - Where is my Character?
- Unicode Standard: Latest Version
- Love Hotels and Unicode
Characters
- Non-breaking space
  - wikipedia: Non-breaking space
  - Keyboard entry:
    - Mac: Option-Space
    - Vim: Ctrl-K Space Space
    - Emacs: Ctrl+X 8 Space
- Special characters
  - Specials (wikipedia)
    - U+FFF9, interlinear annotation anchor, marks start of annotated text
    - U+FFFA, interlinear annotation separator, marks start of annotating text
    - U+FFFB, interlinear annotation terminator, marks end of annotating text
    - U+FFFC, object replacement character, placeholder in the text for another unspecified object, for example in a compound document.
      - This one appears to behave like a "negative" character in gvim mac and the vertical split character on that line (if you have a vertical split window) line will shift left 1 character.
    - U+FFFD � replacement character used to replace an unknown or unprintable character

HowTo

Specify the encoding of files

Python source files

#!/usr/bin/env python  
# -*- coding: UTF-8 -*-

XML source files

The default encoding if nothing else is specified is utf-8.

If no encoding declaration is present in the XML document (and no external encoding declaration mechanism such as the HTTP header is available), the assumed encoding of an XML document depends on the presence of the Byte-Order-Mark (BOM).

The BOM is a Unicode special marker placed at the top of the file that indicate its encoding. The BOM is optional for UTF-8.

<?xml version="1.0" encoding="iso-8859-1"?>

HTML source files

When HTML documents are served there are three ways to tell the browser what specific character encoding is to be used for display to the reader. First, HTTP headers can be sent by the web server along with each web page (HTML document). A typical HTTP header looks like this:

Content-Type: text/html; charset=ISO-8859-1

For HTML (not usually XHTML), the other method is for the HTML document to include this information at its top, inside the HEAD element.

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

XHTML documents have a third option: to express the character encoding in the XML preamble, for example

<?xml version="1.0" encoding="ISO-8859-1"?>

The HTTP header specification supersedes all HTML (or XHTML) meta tag specifications, which can be a problem if the header is incorrect and one does not have the access or the knowledge to change them.

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<html lang="zh-CN">

You can also not declare the encoding, keep it 7 bit, and use &#NNNN; html/xml entity references for the unicode code points. Python's encoding functions can accept errors="xmlcharrefreplace" to do that instead of sending across utf-8. However, this is not so great because editors can't show you a readable representation of such a file, it takes too much space, and there's great support for utf8 since IE 5.0

CSS source files

The encoding of a CSS file is determined according the following rules:

If the file uses HTTP: By the HTTP charset parameter in the Content-Type field.
By the value for the @charset command at the top of the CSS file.
By the declaration mechanism of the referencing document, if one exists. For example in XHTML: the charset attribute of the <link> element.

For example, to specify iso-8859-1 (Latin-1) encoding:

@charset "iso-8859-1"

Lua source files

See http://lua-users.org/wiki/LuaUnicode

Basically, you can have LUA source code. However, you can't use unicode for any LUA identifiers (since LUA uses isalpha, etc. for identifying those). There's also no unicode aware string operations provided. It's mostly just blind to it.