Strings Bytes versus Unicode

suggest change

In Python 2 there are two variants of string: those made of bytes with type (str) and those made of text with type (unicode).

In Python 2, an object of type str is always a byte sequence, but is commonly used for both text and binary data.

A string literal is interpreted as a byte string.

s = 'Cafe'    # type(s) == str

There are two exceptions: You can define a Unicode (text) literal explicitly by prefixing the literal with u:

s = u'Café'   # type(s) == unicode
b = 'Lorem ipsum'  # type(b) == str

Alternatively, you can specify that a whole module’s string literals should create Unicode (text) literals:

from __future__ import unicode_literals

s = 'Café'   # type(s) == unicode
b = 'Lorem ipsum'  # type(b) == unicode

In order to check whether your variable is a string (either Unicode or a byte string), you can use:

isinstance(s, basestring)

In Python 3, the str type is a Unicode text type.

s = 'Cafe'           # type(s) == str
s = 'Café'           # type(s) == str (note the accented trailing e)

Additionally, Python 3 added a bytes object, suitable for binary “blobs” or writing to encoding-independent files. To create a bytes object, you can prefix b to a string literal or call the string’s encode method:

# Or, if you really need a byte string:
s = b'Cafe'          # type(s) == bytes
s = 'Café'.encode()  # type(s) == bytes

To test whether a value is a string, use:

isinstance(s, str)

It is also possible to prefix string literals with a u prefix to ease compatibility between Python 2 and Python 3 code bases. Since, in Python 3, all strings are Unicode by default, prepending a string literal with u has no effect:

u'Cafe' == 'Cafe'

Python 2’s raw Unicode string prefix ur is not supported, however:

>>> ur'Café'
  File "<stdin>", line 1
    ur'Café'
           ^
SyntaxError: invalid syntax

Note that you must encode a Python 3 text (str) object to convert it into a bytes representation of that text. The default encoding of this method is UTF-8.

You can use decode to ask a bytes object for what Unicode text it represents:

>>> b.decode()
'Café'

While the bytes type exists in both Python 2 and 3, the unicode type only exists in Python 2. To use Python 3’s implicit Unicode strings in Python 2, add the following to the top of your code file:

from __future__ import unicode_literals
print(repr("hi"))
# u'hi'

Another important difference is that indexing bytes in Python 3 results in an int output like so:

b"abc"[0] == 97

Whilst slicing in a size of one results in a length 1 bytes object:

b"abc"[0:1] == b"a"

In addition, Python 3 fixes some unusual behavior with unicode, i.e. reversing byte strings in Python 2. For example, the following issue is resolved:

# -*- coding: utf8 -*-
print("Hi, my name is Łukasz Langa.")
print(u"Hi, my name is Łukasz Langa."[::-1])
print("Hi, my name is Łukasz Langa."[::-1])

# Output in Python 2
# Hi, my name is Łukasz Langa.
# .agnaL zsakuŁ si eman ym ,iH
# .agnaL zsaku�� si eman ym ,iH

# Output in Python 3
# Hi, my name is Łukasz Langa.
# .agnaL zsakuŁ si eman ym ,iH
# .agnaL zsakuŁ si eman ym ,iH

Feedback about page:

Feedback:
Optional: your email if you want me to get back to you:


Incompatibilities moving from Python 2 to Python 3:
* Strings Bytes versus Unicode
* map

Table Of Contents
2 Filter
3 List
7 Loops
22 Reduce
27 Classes
31 Set
39 Incompatibilities moving from Python 2 to Python 3
42 Tuple
45 Enum
62 Sockets
89 urllib
92 Idioms
104 Stack
105 Profiling
109 Logging
111 os module
118 Mixins
120 ArcPy
126 Arrays
132 2to3 tool
135 Unicode
138 Neo4j
140 Curses
141 Templates
145 heapq
146 tkinter
154 Audio
155 pyglet
157 ijson
160 Flask
161 Groupby
163 pygame
165 hashlib
166 Gzip
167 ctypes
185 pyaudio
186 shelve