Links
- W3C I18n Site Index
- Language Codes: ISO 639 (wikipedia)
- Composed of 6 parts
- ISO 639-1: Alpha-2 code (official)
- ISO 639-2: Alpha-3 code (official)
- ISO 639-3: Alpha-3 code for comprehensive coverage (official)
- ISO 639-4: Implementation guidelines and general principles for language coding
- ISO 639-5: Alpha-3 code for language families and groups (official)
- ISO 639-6: Alpha-4 representation for comprehensive coverage of language variants (official)
- Composed of 6 parts
- Capitalization/Casing Recommendations
- ISO639-1 recommends that language codes be written in lowercase. ('mn' Mongolian).
- ISO15924 recommends that script codes use lowercase with the initial letter capitalized. ('Cyrl' Cyrillic).
- ISO3166-1 recommends that region/country codes be capitalized. ('MN' Mongolia).
- All other subtags: prefer lowercase.
- Language, Script, Region
- Country Codes: ISO 3166-1 (wikipedia)
- Country or Region codes: UN M.49
- Countries or areas, codes and abbreviations
- Composition of macro geographical (continental) regions, geographical sub-regions, and selected economic and other groupings
- e.g.
001
→ World,019
→ Americas,142
→ Asia,150
→ Europe. - e.g.
199
→ least developed countries,432
→ landlocked developing countries,722
→ small island developing countries, etc.
- e.g.
- Country or area numerical codes added or changed since 1982
- Script codes (for writing): ISO 15924: Codes for the representation of names of scripts (wikipedia)
- Each script is given both a four-letter code and a numeric one.
- Script is defined as "set of graphic characters used for the written form of one or more languages".
- One could differentiate, for example,
between Serbian written in the Cyrillic (
sr-Cyrl
) or Latin (sr-Latn
) script, or mark romanized text as such. - ISO 15924: Code Lists
- Additions and Changes to ISO 15924 Codes
- BCP47 (BCP = Best Current Practices) introduces
lang
(required): ex:en
,fr
script
(optional) ex:Hans
regional variant
(optional) ex: 419 for Latin America (to indicate Latin American Spanish)- Understanding the New Language Tags
- This BCP is composed of two RFCs
- unicode.org: A General Method for Rendering Combining Marks
- unicode.org: Unicode Line Breaking Algorithm
- ICU Home
- Language Plural Rules
- Slides: IUC 36: Plural & Gender & More in Translated Messages
- Library of Congress >> Standards >> ISO 639.2 >> Codes for the representation of names of languages
- CLDR - Unicode Common Locale Data Repository
- Google Developers: Internationalization
- Why flags should not be used to indicate language choice
- Closure JS Library
- OS X
- Interesting Languages for test cases
- Catalan, Romanian, Russian
- Catalan is interesting for gender
- Russian is interesting for pluralization
- Misc
RFC 5646
Ref: RFC 5646: Tags for Identifying Languages (obsoletes rfc4646)
Update: Looks like this is all nicely explained at Language tags in HTML and XML making most of the stuff below unnecessary. Why, oh why, is it so easy to find all those old docs and references but not something like this – the first useful and real reference to RFC 5646! Sigh.
2.2. Language Subtag Sources and Interpretation
- "Tag" refers to a complete language tag, such as "sr-Latn-RS" or "az-Arab-IR".
- "Subtag" refers to a specific section of a tag, delimited by a hyphen, such as the subtags 'zh', 'Hant', and 'CN' in the tag "zh- Hant-CN".
- "Code" refers to values defined in external standards (and that are used as subtags in this document).
- For example, 'Hant' is an ISO15924 script code that was used to define the 'Hant' script subtag for use in a language tag. Examples of codes in this document are enclosed in single quotes ('en', 'Hant').
Language tags are designed so that each subtag type has unique length and content restrictions. These make identification of the subtag's type possible, even if the content of the subtag itself is unrecognized. This allows tags to be parsed and processed without reference to the latest version of the underlying standards or the IANA registry and makes the associated exception handling when parsing tags simpler.
The general formats are (refer the RFC for the ABNF grammar but note that it's not exactly very helpful—for example, the extended language subtag allows 1-3 alpha-3 codes separated by a hyphen and then later mentions that only 1 is legal, the other variations are permanently reserver and specifying them will always be invalid):
<PrimaryLanguage>[-<Script>][-<Region>]*(-<Variant>)[-<PrivateUse>]
<PrimaryLanguage>
is always the first subtag and is required.- It can contain a hyphen. It's composed of
<MacroLanguage>-<ExtendedLanguage>
where at least one of them must be specified. <MacroLanguage>
is either an alpha-2 code, or where an alpha-2 code doesn't exist, an alpha-3 code. It's value comes from the official IANA assignments list (look only forType: language
records.)<ExtendedLanguage>
can only be an alpha-3 code. Again, look at the official IANA assignments list and filter for records withType: extlang
.- Note that the IANA records for an extended
language are always required to correspond to
exactly one
MacroLanguage
. This is implicit when parsed.
- Note that the IANA records for an extended
language are always required to correspond to
exactly one
- Parsing: How do you know when you're done
parsing the PrimaryLanguage subtag (which can be one
of
<MacroLanguage>
,<ExtendedLanguage>
, or<MacroLanguage>-<ExtendedLanguage>
)?- Both
MacroLanguage
andExtendedLanguage
can only be 2 or 3 characters and always ASCII letters. Specifically, they can't be numbers. - If the next hypenated piece isn't exactly 2 or 3 ASCII letters (letters, not digits), it's not part of the language subtag.
- This is because, the following piece can either
be a
Script
,Region
,Variant
, orPrivateUse
subtag.Script
is always 4 characters when present.Region
is either 2 letters or 3 digits.Variant
subtags that begin with a letter at least 5 characters long and those that begin with a digit are at least 4 characters long. - Also,
PrivateUse
andExtension
subtags are always preceded by a single character subtag so seeing such a subtag also means the end of parsing something "useful" (not just when parsing the Language subtag.) You don't need this additional rule since this is covered by first two rules.
- Both
- It can contain a hyphen. It's composed of
Script
is optional. When present, it's always a 4 letter code defined in ISO 15924: List of four-letter script codes. Filter the official IANA assignments list for records withType: script
.Region
is optional. It identifies a country/region/geographic area. When present, it's always a 2 letter code or a 3 digit code. Filter the official IANA assignments list for records withType: region
. This is typically ISO 3166-1 Alpha 2 codes and numeric codes from UN M.49.Variant
subtags are used to indicate additional, well-recognized variations that define a language or its dialects that are not covered by other available subtags. You can have 0 or more of these.- Those that begin with a letter are at least 5 in length and those beginning with a digit are at least 4 in length. Further, they must be distinct (ignoring case.)
- Filter the official IANA assignments
list
for records with
Type: variant
.
Macro Languages
Ref: ISO 639-3 and Macro
Languages
(Note that that page is a bit outdated and talks about stuff
that has already happened. Also, the macro language is
not required and instead, the variations such as
zh-cmn
, zh-cmn-Hant
are marked redundant and the
preferred tags are instead just cmn
and cmn-Hant
respectively.) (Aside: The redundant variations, along with
optional preferred versions, can be obtained by filtering
records in the official IANA assignments
list
for Type: redundant
. For example, many sign languages.)
ISO 639 has occasionally assigned codes to "macro-languages", which are language families that contain a number of recognizably related (but not necessarily mutually intelligible) languages. A good example of this is Chinese.
The ISO 639-1 code 'zh' identifies "Chinese", but the concept of Chinese encloses a number of distinct languages or dialects that share certain traits. While these languages are written very similarly, spoken content is very different indeed. The available regional options are poor proxies for the spoken dialects (many of which are confined to mainland China).
Mandarin Chinese (a spoken variation) is identified by
the ISO 639-3 code cmn
.
More from RFC 5646
- Some of the subtags in the IANA registry do not come from an underlying standard. These can only appear in specific positions in a tag—they can only occur as primary language subtags or as variant subtags.
- Sequences of private use and extension
subtags MUST occur at the end of the
sequence of subtags and MUST NOT be
interspersed with subtags defined elsewhere
in this document. These sequences are
introduced by single-character subtags,
which are reserved as follows:
- The single-letter subtag 'x' introduces a sequence of private use subtags.
- The single-letter subtag 'i' is used by some grandfathered tags, such as "i-default", where it always appears in the first position and cannot be confused with an extension.
- All other single-letter and single-digit subtags are reserved to introduce standardized extension subtag sequences as described in Section 3.7: Extensions and the Extensions Registry
- Extended Language Subtags
- Extended language subtags consist solely of three-letter subtags.
- All extended language subtag records defined in the registry were defined according to the assignments found in ISO639-3.
- Language collections and groupings, such as defined in ISO639-5, are specifically excluded from being extended language subtags.