One of the later assignments in my Foundations of Computing class in college was to implement a recursive descent parser that outputs assembly code (defined within the assignment). I implemented mine in C++. Here’s the program and the source code.
This program accepts any arbitrary grammar as an input file, as well as a file that contains the input string to be parsed. When the program starts, it looks for a file called grammar.txt
in the program’s directory. The grammar should have the following form:
<L> -> <A><L'>
<L'> -> ;<A><L'> | <epsilon><A> -> <I>=<E><#gen>
<E> -> <T><E'>
<E'> -> -<T><#gen><E'> | +<T><#gen><E'> | <epsilon>
<T> -> <F><T'>
<T'> -> /<F><#gen><T'> | *<F><#gen><T'> | <epsilon>
<F> -> <P><F'>
<F'> -> ^<F><#gen> | <epsilon>
<P> -> (<E>) | <I>
<I> -> w|x|y|z
The example grammar above defines a language where any mathematical expression can be formed with the variables w, x, y, z, and the operators +, -, *, /, ^, and =. Multiple expressions can also be separated by a semicolon.
Symbols enclosed in brackets “<…>” are non-terminals. The arrow symbol “->” separates the two sides of the given rewrite rule (production). The “or” symbol “|” is used to separate different choices for a single rewrite rule. Any other symbol is a terminal. The special symbol <epsilon> represents a null or empty terminal symbol (sometimes referred to as lambda). The other special symbol <#gen> is an instruction for the parser to generate code at the point where this symbol appears. Comments can be written in the grammar file by placing a # symbol at the beginning of the line.
The parser supports exactly one non-terminal and no terminals on the LHS (left-hand side) of any rewrite rule. The RHS of any rewrite rule can be arbitrary.
After loading the grammar, the program will ask you to type in the string to be parsed. Alternatively, you can create a file that contains the string, and pass it as a command line parameter to the program. If the data is parsed successfully, the program will generate a file called out.txt
that will contain the assembly code generated by the grammar.