i18n

Links

W3C I18n Site Index
- An Introduction to Multilingual Web Addresses
Language Codes: ISO 639 (wikipedia)
- Composed of 6 parts
  - ISO 639-1: Alpha-2 code (official)
  - ISO 639-2: Alpha-3 code (official)
  - ISO 639-3: Alpha-3 code for comprehensive coverage (official)
  - ISO 639-4: Implementation guidelines and general principles for language coding
  - ISO 639-5: Alpha-3 code for language families and groups (official)
  - ISO 639-6: Alpha-4 representation for comprehensive coverage of language variants (official)
Capitalization/Casing Recommendations
- ISO639-1 recommends that language codes be written in lowercase. ('mn' Mongolian).
- ISO15924 recommends that script codes use lowercase with the initial letter capitalized. ('Cyrl' Cyrillic).
- ISO3166-1 recommends that region/country codes be capitalized. ('MN' Mongolia).
- All other subtags: prefer lowercase.
Language, Script, Region
- Charts Listing
- Scripts, Languages, and Territories
- www.iana.org/assignments/language-subtag-registry
- Likely Subtags → (language, script, region)
- Language-Territory Information
- Territory-Language Information
Country Codes: ISO 3166-1 (wikipedia)
Country or Region codes: UN M.49
- Countries or areas, codes and abbreviations
- Composition of macro geographical (continental) regions, geographical sub-regions, and selected economic and other groupings
  - e.g. 001 → World, 019 → Americas, 142 → Asia, 150 → Europe.
  - e.g. 199 → least developed countries, 432 → landlocked developing countries, 722 → small island developing countries, etc.
- Country or area numerical codes added or changed since 1982
Script codes (for writing): ISO 15924: Codes for the representation of names of scripts (wikipedia)
- Each script is given both a four-letter code and a numeric one.
- Script is defined as "set of graphic characters used for the written form of one or more languages".
- One could differentiate, for example, between Serbian written in the Cyrillic (sr-Cyrl) or Latin (sr-Latn) script, or mark romanized text as such.
- ISO 15924: Code Lists
  - List of four-letter script codes
  - List of numeric script codes
- Additions and Changes to ISO 15924 Codes
BCP47 (BCP = Best Current Practices) introduces
- lang (required): ex: en, fr
- script (optional) ex: Hans
- regional variant (optional) ex: 419 for Latin America (to indicate Latin American Spanish)
- Understanding the New Language Tags
- This BCP is composed of two RFCs
  - RFC 5646: Tags for Identifying Languages
  - RFC 4647: Matching Language Tags
unicode.org: A General Method for Rendering Combining Marks
unicode.org: Unicode Line Breaking Algorithm
ICU Home
- ICU User Guide
- PyICU
Language Plural Rules
Slides: IUC 36: Plural & Gender & More in Translated Messages
Library of Congress >> Standards >> ISO 639.2 >> Codes for the representation of names of languages
CLDR - Unicode Common Locale Data Repository
- CLDR Charts
- Survey Tool
Google Developers: Internationalization
Why flags should not be used to indicate language choice
Closure JS Library
- Google Developers: Translation
- wiki: Translations
OS X
- Internationalization and Localization Guide
  - Localizing Your App
    - Exporting Localizations
Interesting Languages for test cases
- Catalan, Romanian, Russian
- Catalan is interesting for gender
- Russian is interesting for pluralization
Misc
- MessageFormat (Java Platform SE 7 )

RFC 5646

Ref: RFC 5646: Tags for Identifying Languages (obsoletes rfc4646)

Update: Looks like this is all nicely explained at Language tags in HTML and XML making most of the stuff below unnecessary. Why, oh why, is it so easy to find all those old docs and references but not something like this – the first useful and real reference to RFC 5646! Sigh.

2.2. Language Subtag Sources and Interpretation

"Tag" refers to a complete language tag, such as "sr-Latn-RS" or "az-Arab-IR".
"Subtag" refers to a specific section of a tag, delimited by a hyphen, such as the subtags 'zh', 'Hant', and 'CN' in the tag "zh- Hant-CN".
"Code" refers to values defined in external standards (and that are used as subtags in this document).
- For example, 'Hant' is an ISO15924 script code that was used to define the 'Hant' script subtag for use in a language tag. Examples of codes in this document are enclosed in single quotes ('en', 'Hant').

Language tags are designed so that each subtag type has unique length and content restrictions. These make identification of the subtag's type possible, even if the content of the subtag itself is unrecognized. This allows tags to be parsed and processed without reference to the latest version of the underlying standards or the IANA registry and makes the associated exception handling when parsing tags simpler.

The general formats are (refer the RFC for the ABNF grammar but note that it's not exactly very helpful—for example, the extended language subtag allows 1-3 alpha-3 codes separated by a hyphen and then later mentions that only 1 is legal, the other variations are permanently reserver and specifying them will always be invalid):

<PrimaryLanguage>[-<Script>][-<Region>]*(-<Variant>)[-<PrivateUse>]

<PrimaryLanguage> is always the first subtag and is required.
- It can contain a hyphen. It's composed of <MacroLanguage>-<ExtendedLanguage> where at least one of them must be specified.
- <MacroLanguage> is either an alpha-2 code, or where an alpha-2 code doesn't exist, an alpha-3 code. It's value comes from the official IANA assignments list (look only for Type: language records.)
- <ExtendedLanguage> can only be an alpha-3 code. Again, look at the official IANA assignments list and filter for records with Type: extlang.
  - Note that the IANA records for an extended language are always required to correspond to exactly one MacroLanguage. This is implicit when parsed.
- Parsing: How do you know when you're done parsing the PrimaryLanguage subtag (which can be one of <MacroLanguage>, <ExtendedLanguage>, or <MacroLanguage>-<ExtendedLanguage>)?
  - Both MacroLanguage and ExtendedLanguage can only be 2 or 3 characters and always ASCII letters. Specifically, they can't be numbers.
  - If the next hypenated piece isn't exactly 2 or 3 ASCII letters (letters, not digits), it's not part of the language subtag.
  - This is because, the following piece can either be a Script, Region, Variant, or PrivateUse subtag. Script is always 4 characters when present. Region is either 2 letters or 3 digits. Variant subtags that begin with a letter at least 5 characters long and those that begin with a digit are at least 4 characters long.
  - Also, PrivateUse and Extension subtags are always preceded by a single character subtag so seeing such a subtag also means the end of parsing something "useful" (not just when parsing the Language subtag.) You don't need this additional rule since this is covered by first two rules.
Script is optional. When present, it's always a 4 letter code defined in ISO 15924: List of four-letter script codes. Filter the official IANA assignments list for records with Type: script.
Region is optional. It identifies a country/region/geographic area. When present, it's always a 2 letter code or a 3 digit code. Filter the official IANA assignments list for records with Type: region. This is typically ISO 3166-1 Alpha 2 codes and numeric codes from UN M.49.
Variant subtags are used to indicate additional, well-recognized variations that define a language or its dialects that are not covered by other available subtags. You can have 0 or more of these.
- Those that begin with a letter are at least 5 in length and those beginning with a digit are at least 4 in length. Further, they must be distinct (ignoring case.)
- Filter the official IANA assignments list for records with Type: variant.

Macro Languages

Ref: ISO 639-3 and Macro Languages (Note that that page is a bit outdated and talks about stuff that has already happened. Also, the macro language is not required and instead, the variations such as zh-cmn, zh-cmn-Hant are marked redundant and the preferred tags are instead just cmn and cmn-Hant respectively.) (Aside: The redundant variations, along with optional preferred versions, can be obtained by filtering records in the official IANA assignments list for Type: redundant. For example, many sign languages.)

ISO 639 has occasionally assigned codes to "macro-languages", which are language families that contain a number of recognizably related (but not necessarily mutually intelligible) languages. A good example of this is Chinese.

The ISO 639-1 code 'zh' identifies "Chinese", but the concept of Chinese encloses a number of distinct languages or dialects that share certain traits. While these languages are written very similarly, spoken content is very different indeed. The available regional options are poor proxies for the spoken dialects (many of which are confined to mainland China).

Mandarin Chinese (a spoken variation) is identified by the ISO 639-3 code cmn.

More from RFC 5646

Some of the subtags in the IANA registry do not come from an underlying standard. These can only appear in specific positions in a tag—they can only occur as primary language subtags or as variant subtags.
Sequences of private use and extension subtags MUST occur at the end of the sequence of subtags and MUST NOT be interspersed with subtags defined elsewhere in this document. These sequences are introduced by single-character subtags, which are reserved as follows:
- The single-letter subtag 'x' introduces a sequence of private use subtags.
- The single-letter subtag 'i' is used by some grandfathered tags, such as "i-default", where it always appears in the first position and cannot be confused with an extension.
- All other single-letter and single-digit subtags are reserved to introduce standardized extension subtag sequences as described in Section 3.7: Extensions and the Extensions Registry
Extended Language Subtags
- Extended language subtags consist solely of three-letter subtags.
- All extended language subtag records defined in the registry were defined according to the assignments found in ISO639-3.
- Language collections and groupings, such as defined in ISO639-5, are specifically excluded from being extended language subtags.