unicode

Links

Unicode in Python
stackoverflow.com: How to use list of python objects whose representation is unicode
How to Use UTF-8 with Python
Unicode for python identifiers
- Supported in python 3.
- See PEP 3131: Supporting Non-ASCII Identifiers
- Allowed characters – not exhaustive
More About Unicode in Python 2 and 3 | Armin Ronacher
Bugs
- bug 4947 and bugfix
  - sys.stdout fails to use default encoding as advertised.
  - Fixed in python 2.7 but not backported to python 2.6.
  - The bug:
    print >>my_file, my_unicode # <- is encoded with my_file.encoding
    my_file.write(my_unicode) # <- is encoded with my_file.encoding
    
    print my_unicode -- works # <- is encoded with my_file.encoding
    sys.stdout.write(my_unicode) # <- is encoded with sys.getdefaultencoding()
sys.stdout
- Even if your terminal is UTF-8 and things magically appear to work, they may break when you're piping the output.
- Under Python 2, treat stdin and stdout as byte streams.

Notes

Specify the encoding of files

Ref: PEP 263: Defining Python Source Code Encodings

#!/usr/bin/env python  
# -*- coding: UTF-8 -*-

Specifying unicode strings

>>> u"\N{GREEK CAPITAL LETTER DELTA}"  # Using the character name '\u0394'
>>> u"\u0394"                          # Using a 16-bit hex value '\u0394'
>>> u"\U00000394"                      # Using a 32-bit hex value '\u0394'

Convert a bytes string that is somehow typed unicode to str.

# `ISO-8859-1` aka `Latin-1` is the only encoding whose
# 256 characters are identical to the 256 first characters of
# Unicode.
import codecs
str_string = codecs.latin_1_encode(unicode_string_which_is_actually_binary)

The various encodings

locale.getpreferredencoding()
sys.getfilesystemencoding()
sys.stdin.encoding / sys.stdout.encoding / sys.stderr.encoding

Opening files.

import io
infile = io.open('UTF-8.txt', 'rt', encoding='UTF-8')

import codecs
codecs.open('UTF-8.txt', 'rt', encoding='UTF-8')

# python 3
open(filename, 'r', encoding='UTF-8')

`sys.stdout`'s encoding

import codecs
import sys
sys.stdout = codecs.getwriter('UTF-8')(sys.stdout)

`sys.stdin`'s encoding

For Python 3, refer: sys.stdin docs

Refer:

Python 2: Why do I have to do: sys.stdin = codecs.getreader(sys.stdin.encoding)(sys.stdin)
- Or sys.stdin = codecs.getreader(locale.getpreferredencoding())(sys.stdin)
- This is because reads do not perform conversion – you get bytes (the encoding attribute doesn't do anything – it only affects writes.)
Python 3: How to specify stdin encoding

Python 3

Python 3 does not expect ASCII from sys.stdin. It'll open stdin in text mode and make an educated guess as to what encoding is used. That guess may come down to ASCII, but that is not a given. See the sys.stdin documentation.

Like other file objects opened in text mode, the sys.stdin object derives from the io.TextIOBase base class; it has a .buffer attribute pointing to the underlying buffered IO instance (which in turn has a .raw attribute).

Wrap the sys.stdin.buffer attribute in a new io.TextIOWrapper() instance to specify a different encoding:

import io
import sys

input_stream = io.TextIOWrapper(sys.stdin.buffer, encoding='utf-8')