Links
- Unicode in Python
- stackoverflow.com: How to use list of python objects whose representation is unicode
- How to Use UTF-8 with Python
- Unicode for python identifiers
- Supported in python 3.
- See PEP 3131: Supporting Non-ASCII Identifiers
- Allowed characters – not exhaustive
- More About Unicode in Python 2 and 3 | Armin Ronacher
- Bugs
- bug 4947 and bugfix
sys.stdout
fails to use default encoding as advertised.- Fixed in python 2.7 but not backported to python 2.6.
- The bug:
print >>my_file, my_unicode # <- is encoded with my_file.encoding
my_file.write(my_unicode) # <- is encoded with my_file.encoding
print my_unicode -- works # <- is encoded with my_file.encoding
sys.stdout.write(my_unicode) # <- is encoded with sys.getdefaultencoding()
- bug 4947 and bugfix
sys.stdout
- Even if your terminal is UTF-8 and things magically appear to work, they may break when you're piping the output.
- Under Python 2, treat stdin and stdout as byte streams.
Notes
Specify the encoding of files
Ref: PEP 263: Defining Python Source Code Encodings
#!/usr/bin/env python # -*- coding: UTF-8 -*-
Specifying unicode strings
>>> u"\N{GREEK CAPITAL LETTER DELTA}" # Using the character name '\u0394' >>> u"\u0394" # Using a 16-bit hex value '\u0394' >>> u"\U00000394" # Using a 32-bit hex value '\u0394'
Convert a bytes string that is somehow typed unicode to str.
# `ISO-8859-1` aka `Latin-1` is the only encoding whose # 256 characters are identical to the 256 first characters of # Unicode. import codecs str_string = codecs.latin_1_encode(unicode_string_which_is_actually_binary)
The various encodings
locale.getpreferredencoding() sys.getfilesystemencoding() sys.stdin.encoding / sys.stdout.encoding / sys.stderr.encoding
Opening files.
import io infile = io.open('UTF-8.txt', 'rt', encoding='UTF-8') import codecs codecs.open('UTF-8.txt', 'rt', encoding='UTF-8') # python 3 open(filename, 'r', encoding='UTF-8')
sys.stdout
's encoding
import codecs import sys sys.stdout = codecs.getwriter('UTF-8')(sys.stdout)
sys.stdin
's encoding
For Python 3, refer: sys.stdin docs
Refer:
- Python 2: Why do I have to do:
sys.stdin = codecs.getreader(sys.stdin.encoding)(sys.stdin)
- Or
sys.stdin = codecs.getreader(locale.getpreferredencoding())(sys.stdin)
- This is because reads do not perform conversion – you get bytes (the encoding attribute doesn't do anything – it only affects writes.)
- Or
- Python 3: How to specify stdin encoding
Python 3
Python 3 does not expect ASCII from sys.stdin
. It'll open
stdin
in text mode and make an educated guess as to what
encoding is used. That guess may come down to ASCII, but
that is not a given. See the sys.stdin
documentation.
Like other file objects opened in text mode, the sys.stdin
object derives from the io.TextIOBase
base class; it has a
.buffer
attribute pointing to the underlying buffered IO
instance (which in turn has a .raw
attribute).
Wrap the sys.stdin.buffer
attribute in a new io.TextIOWrapper()
instance to specify a different encoding:
import io import sys input_stream = io.TextIOWrapper(sys.stdin.buffer, encoding='utf-8')