Conversion between str or bytes data and unicode characters

suggest change

The contents of files and network messages may represent encoded characters. They often need to be converted to unicode for proper display.

In Python 2, you may need to convert str data to Unicode characters. The default ('', "", etc.) is an ASCII string, with any values outside of ASCII range displayed as escaped values. Unicode strings are u'' (or u"", etc.).

# You get "© abc" encoded in UTF-8 from a file, network, or other data source

s = '\xc2\xa9 abc'  # s is a byte array, not a string of characters
                    # Doesn't know the original was UTF-8
                    # Default form of string literals in Python 2
s[0]                # '\xc2' - meaningless byte (without context such as an encoding)
type(s)             # str - even though it's not a useful one w/o having a known encoding

u = s.decode('utf-8')  # u'\xa9 abc'
                       # Now we have a Unicode string, which can be read as UTF-8 and printed properly
                       # In Python 2, Unicode string literals need a leading u
                       # str.decode converts a string which may contain escaped bytes to a Unicode string
u[0]                # u'\xa9' - Unicode Character 'COPYRIGHT SIGN' (U+00A9) '©'
type(u)             # unicode

u.encode('utf-8')   # '\xc2\xa9 abc'
                    # unicode.encode produces a string with escaped bytes for non-ASCII characters

In Python 3 you may need to convert arrays of bytes (referred to as a ‘byte literal’) to strings of Unicode characters. The default is now a Unicode string, and bytestring literals must now be entered as b'', b"", etc. A byte literal will return True to isinstance(some_val, byte), assuming some_val to be a string that might be encoded as bytes.

# You get from file or network "© abc" encoded in UTF-8

s = b'\xc2\xa9 abc' # s is a byte array, not characters
                    # In Python 3, the default string literal is Unicode; byte array literals need a leading b
s[0]                # b'\xc2' - meaningless byte (without context such as an encoding)
type(s)             # bytes - now that byte arrays are explicit, Python can show that.

u = s.decode('utf-8')  # '© abc' on a Unicode terminal
                       # bytes.decode converts a byte array to a string (which will, in Python 3, be Unicode)
u[0]                # '\u00a9' - Unicode Character 'COPYRIGHT SIGN' (U+00A9) '©'
type(u)             # str
                    # The default string literal in Python 3 is UTF-8 Unicode

u.encode('utf-8')   # b'\xc2\xa9 abc'
                    # str.encode produces a byte array, showing ASCII-range bytes as unescaped characters.

Found a mistake? Have a question or improvement idea? Let me know.

String Methods:

* String Methods

* Changing the capitalization of a string

* str.translate Translating characters in a string

* str.format and f-strings Format values into a string

* String modules useful constants

* Replace all occurrences of one substring with another substring

* Reversing a string

* Testing what a string is composed of

* Split a string based on a delimiter into a list of strings

* Stripping unwanted leading/trailing characters from a string

* String Contains

* Join a list of strings into one string

* Counting number of times a substring appears in a string

* Case insensitive string comparisons

* Test the starting and ending characters of a string

* Justify strings

* Conversion between str or bytes data and unicode characters

Table Of Contents

0 Getting Started

1 List comprehension

2 Filter

3 List

4 Functions

5 Decorators

6 Math module

7 Loops

8 Random module

9 Comparisons

10 Importing modules

11 Sorting Minimum and Maximium

12 Operator module

13 Variable Scope and Binding

14 Basic Input and Output

15 Files, Folders, I/O

16 JSON Module

17 String Methods

18 Metaclasses

19 Indexing and Slicing

20 Generators

21 Simple Mathematical Operators

22 Reduce

23 Map Function

24 Exponentation

25 Searching

26 Dictionary

27 Classes

28 Counting

29 Manipulating XML

30 Date and Time

31 Set

32 Collections module

33 Parallel computation

34 Multithreading

35 Writing C extensions

36 Unit Testing

37 Regular Expressions

38 Bitwise Operators

39 Incompatibilities moving from Python 2 to Python 3

40 Virtual environments

41 Copying data

42 Tuple

43 Context Managers with Statement

44 Hidden Features

45 Enum

46 String Formatting

47 Conditionals

48 Complex math

49 Unicode and bytes

50 The __name__ special variable

51 Check if path exists

52 Networking

53 Asyncio Module

54 Print Function

55 os.path module

56 Creating Python packages

57 Parsing Command Line Arguments

58 HTML Parsing

59 Subprocess Library

60 setup.py

61 List slicing

62 Sockets

63 Itertools Module

64 Recursion

65 Boolean Operators

66 dis module

67 Type Hints

68 pip PyPI Package Manager

69 locale module

70 Exceptions

71 Web scraping

72 deque module

73 Distributing self-contained applications

74 Property Objects

75 Overloading

76 Debugging

77 Reading and Writing CSV

78 Dynamic code execution with exec and eval

79 PyInstaller - Distributing Python Code

80 Iterables and Iterators

81 Data Visualization

82 The interpreter command line console

83 args and kwargs

84 functools module

85 Garbage Collection

86 Indentation

87 Security and Cryptograhy

88 Pickle data serialization

89 urllib

90 Binary Data

91 Python and Excel

92 Idioms

93 Method Overriding

94 Difference between a module and a package

95 Data Serialization

96 Python Concurrency

97 RabbitMQ using AMQPStorm

98 PostgreSQL

99 Descriptor

100 Common Pitfalls

101 Multiprocessing

102 Creating temporary files with tempfile

103 Working with ZIP files

104 Stack

105 Profiling

106 User-Defined Methods

107 Working around Global Interpreter Lock

108 Deployment using conda

109 Logging

110 Processes and Threads

111 os module

112 Comments and documentation

113 Database Access

114 Python HTTP Server

115 Alternatives to switch statement from other languages

116 List destructuring

117 Accessing Python source code and bytecode

118 Mixins

119 Attribute Access

120 ArcPy

121 Python Anti-Patterns

122 Plugin and Extension Classes

123 Websockets

124 Immutable data types

125 String representations of class

126 Arrays

127 Operator Precedence

128 Polymorphism

129 Alternative Python implementations

130 List Comprehensions

131 Web Server Gateway Intrerface WSGI

132 2to3 tool

133 Abstract Syntax Tree

134 Abstract Base Classes

135 Unicode

136 ssh in Python

137 Serial Communication with pyserial

138 Neo4j

139 Performance optimization

140 Curses

141 Templates

142 pass statement

143 Testing with py.test

144 Date Formatting

145 heapq

146 tkinter

147 CLI subcommands

148 Defining functions with list arguments

149 SQLite3 module

150 Persistence with pickle

151 Connecting to SQL Server

152 Design Patterns

153 Multidimensional arrays

154 Audio

155 pyglet

156 queue module

157 ijson

158 webbrowser module

159 base64 module

160 Flask

161 Groupby

162 Sockets and Message Encryption / Decryption

163 pygame

164 Input Subset and Output External Data Files using Pandas

165 hashlib

166 Gzip

167 ctypes

168 Creating a Windows Service

169 Mutable vs. Immutable

170 configparser

171 Common Exceptions

172 Optical Character Recognition OCR

173 Python Data Types

174 Partial functions

175 Generating graphs

176 Unzipping Files

177 Functional Programming

178 Python Virtual Environment - virtualenv

179 sys module

180 virtual environment with virtualenvwrapper

181 virtualenvwrapper on Windows

182 Python Requests Post

183 Plotting with Matplotlib

184 Python Lex-Yacc

185 pyaudio

186 shelve

187 pip and PyPI Package Manager

188 Writing to CSV from String or List

189 Raise Custom Errors Exceptions

190 Using loops within functions

191 Contributors