Part 2 Parsing Tokenized Input with Yacc

suggest change

This section explains how the tokenized input from Part 1 is processed - it is done using Context Free Grammars (CFGs). The grammar must be specified, and the tokens are processed according to the grammar. Under the hood, the parser uses an LALR parser.

# Yacc example

import ply.yacc as yacc

# Get the token map from the lexer. This is required.
from calclex import tokens

def p_expression_plus(p):
    'expression : expression PLUS term'
    p[0] = p[1] + p[3]

def p_expression_minus(p):
    'expression : expression MINUS term'
    p[0] = p[1] - p[3]

def p_expression_term(p):
    'expression : term'
    p[0] = p[1]

def p_term_times(p):
    'term : term TIMES factor'
    p[0] = p[1] * p[3]

def p_term_div(p):
    'term : term DIVIDE factor'
    p[0] = p[1] / p[3]

def p_term_factor(p):
    'term : factor'
    p[0] = p[1]

def p_factor_num(p):
    'factor : NUMBER'
    p[0] = p[1]

def p_factor_expr(p):
    'factor : LPAREN expression RPAREN'
    p[0] = p[2]

# Error rule for syntax errors
def p_error(p):
    print("Syntax error in input!")

# Build the parser
parser = yacc.yacc()

while True:
   try:
       s = raw_input('calc > ')
   except EOFError:
       break
   if not s: continue
   result = parser.parse(s)
   print(result)

Breakdown

Each grammar rule is defined by a function where the docstring to that function contains the appropriate context-free grammar specification. The statements that make up the function body implement the semantic actions of the rule. Each function accepts a single argument p that is a sequence containing the values of each grammar symbol in the corresponding rule. The values of p[i] are mapped to grammar symbols as shown here:

def p_expression_plus(p):
    'expression : expression PLUS term'
    #   ^            ^        ^    ^
    #  p[0]         p[1]     p[2] p[3]

    p[0] = p[1] + p[3]

For tokens, the “value” of the corresponding p[i] is the same as the p.value attribute assigned in the lexer module. So, PLUS will have the value \+.
For non-terminals, the value is determined by whatever is placed in p[0]. If nothing is placed, the value is None. Also, p[-1] is not the same as p[3], since p is not a simple list (p[-1] can specify embedded actions (not discussed here)).

Note that the function can have any name, as long as it is preceeded by p_.

The p_error(p) rule is defined to catch syntax errors (same as yyerror in yacc/bison).
Multiple grammar rules can be combined into a single function, which is a good idea if productions have a similar structure.

def p_binary_operators(p):
    '''expression : expression PLUS term
                  | expression MINUS term
       term       : term TIMES factor
                  | term DIVIDE factor'''
    if p[2] == '+':
        p[0] = p[1] + p[3]
    elif p[2] == '-':
        p[0] = p[1] - p[3]
    elif p[2] == '*':
        p[0] = p[1] * p[3]
    elif p[2] == '/':
        p[0] = p[1] / p[3]

Character literals can be used instead of tokens.

def p_binary_operators(p):
    '''expression : expression '+' term
                  | expression '-' term
       term       : term '*' factor
                  | term '/' factor'''
    if p[2] == '+':
        p[0] = p[1] + p[3]
    elif p[2] == '-':
        p[0] = p[1] - p[3]
    elif p[2] == '*':
        p[0] = p[1] * p[3]
    elif p[2] == '/':
        p[0] = p[1] / p[3]

Of course, the literals must be specified in the lexer module.

Empty productions have the form '''symbol : '''
To explicitly set the start symbol, use start = 'foo', where foo is some non-terminal.
Setting precedence and associativity can be done using the precedence variable.

precedence = (
    ('nonassoc', 'LESSTHAN', 'GREATERTHAN'),  # Nonassociative operators
    ('left', 'PLUS', 'MINUS'),
    ('left', 'TIMES', 'DIVIDE'),
    ('right', 'UMINUS'),            # Unary minus operator
)

Tokens are ordered from lowest to highest precedence. nonassoc means that those tokens do not associate. This means that something like a < b < c is illegal whereas a < b is still legal.

parser.out is a debugging file that is created when the yacc program is executed for the first time. Whenever a shift/reduce conflict occurs, the parser always shifts.

Found a mistake? Have a question or improvement idea? Let me know.

Python Lex-Yacc:

* Python Lex-Yacc

* Getting Started with PLY

* The Hello World of PLY - A Simple Calculator

* Part 1 Tokenizing Input with Lex

* Part 2 Parsing Tokenized Input with Yacc

Table Of Contents

0 Getting Started

1 List comprehension

2 Filter

3 List

4 Functions

5 Decorators

6 Math module

7 Loops

8 Random module

9 Comparisons

10 Importing modules

11 Sorting Minimum and Maximium

12 Operator module

13 Variable Scope and Binding

14 Basic Input and Output

15 Files, Folders, I/O

16 JSON Module

17 String Methods

18 Metaclasses

19 Indexing and Slicing

20 Generators

21 Simple Mathematical Operators

22 Reduce

23 Map Function

24 Exponentation

25 Searching

26 Dictionary

27 Classes

28 Counting

29 Manipulating XML

30 Date and Time

31 Set

32 Collections module

33 Parallel computation

34 Multithreading

35 Writing C extensions

36 Unit Testing

37 Regular Expressions

38 Bitwise Operators

39 Incompatibilities moving from Python 2 to Python 3

40 Virtual environments

41 Copying data

42 Tuple

43 Context Managers with Statement

44 Hidden Features

45 Enum

46 String Formatting

47 Conditionals

48 Complex math

49 Unicode and bytes

50 The __name__ special variable

51 Check if path exists

52 Networking

53 Asyncio Module

54 Print Function

55 os.path module

56 Creating Python packages

57 Parsing Command Line Arguments

58 HTML Parsing

59 Subprocess Library

60 setup.py

61 List slicing

62 Sockets

63 Itertools Module

64 Recursion

65 Boolean Operators

66 dis module

67 Type Hints

68 pip PyPI Package Manager

69 locale module

70 Exceptions

71 Web scraping

72 deque module

73 Distributing self-contained applications

74 Property Objects

75 Overloading

76 Debugging

77 Reading and Writing CSV

78 Dynamic code execution with exec and eval

79 PyInstaller - Distributing Python Code

80 Iterables and Iterators

81 Data Visualization

82 The interpreter command line console

83 args and kwargs

84 functools module

85 Garbage Collection

86 Indentation

87 Security and Cryptograhy

88 Pickle data serialization

89 urllib

90 Binary Data

91 Python and Excel

92 Idioms

93 Method Overriding

94 Difference between a module and a package

95 Data Serialization

96 Python Concurrency

97 RabbitMQ using AMQPStorm

98 PostgreSQL

99 Descriptor

100 Common Pitfalls

101 Multiprocessing

102 Creating temporary files with tempfile

103 Working with ZIP files

104 Stack

105 Profiling

106 User-Defined Methods

107 Working around Global Interpreter Lock

108 Deployment using conda

109 Logging

110 Processes and Threads

111 os module

112 Comments and documentation

113 Database Access

114 Python HTTP Server

115 Alternatives to switch statement from other languages

116 List destructuring

117 Accessing Python source code and bytecode

118 Mixins

119 Attribute Access

120 ArcPy

121 Python Anti-Patterns

122 Plugin and Extension Classes

123 Websockets

124 Immutable data types

125 String representations of class

126 Arrays

127 Operator Precedence

128 Polymorphism

129 Alternative Python implementations

130 List Comprehensions

131 Web Server Gateway Intrerface WSGI

132 2to3 tool

133 Abstract Syntax Tree

134 Abstract Base Classes

135 Unicode

136 ssh in Python

137 Serial Communication with pyserial

138 Neo4j

139 Performance optimization

140 Curses

141 Templates

142 pass statement

143 Testing with py.test

144 Date Formatting

145 heapq

146 tkinter

147 CLI subcommands

148 Defining functions with list arguments

149 SQLite3 module

150 Persistence with pickle

151 Connecting to SQL Server

152 Design Patterns

153 Multidimensional arrays

154 Audio

155 pyglet

156 queue module

157 ijson

158 webbrowser module

159 base64 module

160 Flask

161 Groupby

162 Sockets and Message Encryption / Decryption

163 pygame

164 Input Subset and Output External Data Files using Pandas

165 hashlib

166 Gzip

167 ctypes

168 Creating a Windows Service

169 Mutable vs. Immutable

170 configparser

171 Common Exceptions

172 Optical Character Recognition OCR

173 Python Data Types

174 Partial functions

175 Generating graphs

176 Unzipping Files

177 Functional Programming

178 Python Virtual Environment - virtualenv

179 sys module

180 virtual environment with virtualenvwrapper

181 virtualenvwrapper on Windows

182 Python Requests Post

183 Plotting with Matplotlib

184 Python Lex-Yacc

185 pyaudio

186 shelve

187 pip and PyPI Package Manager

188 Writing to CSV from String or List

189 Raise Custom Errors Exceptions

190 Using loops within functions

191 Contributors