Top-Down Operator Precedence Parsing with R
Posted on February 12, 2016Parsers are a topic that have been brewing in the back of my head for a while now. Mainly, “how do I write one?” and “I really want to write one!”. There’s something very … zen, about the idea of writing a program that understands how to read and interpret another program.
However, I’ve found parser theory to be fairly hard to understand – it’s been like fitting a square peg into a round brain for me. It doesn’t help that the existing literature is quite technical, and (IMHO) there is not much accessible literature that attempts to bridge the gap between formal computer science and software engineering. It also doesn’t help that parsers are widely considered a solved problem in computer science, so many aren’t really interested in talking about them anymore – just throw flex and bison at your language and call it a day.
However, I first gained hope that I might one day be able to write my own parser, when I learned about recursive descent parsing. These are relatively simple to understand, as long as you can write down a specification of your language in Extended Backus-Naur Form. EBNF is used to describe a language with a set of rules, which read quite naturally. For example, we could have a rule for R function definitions written as:
<function>: 'function' '(' <arg-list> ')' <expression>
where <arg-list>
and <expression>
are also rules
defining what an argument list and an expression can look
like, respectively. In other words, a function in the R
language is specified first with a function
token, then
with an opening parenthesis (
token, then with a (possibly
empty) argument list, which is then enclosed by a closing
parenthesis )
token, which is finally followed by a
(non-empty) function body expression.
Given this, we could write the following pseudo-ish code to
parse the function. check()
is used to validate a token’s
contents, and then advance (behind the scenes) to the next
token.
Don’t think too much about how the above is implemented, just focus on the code’s structure. The routine walks over tokens, checking them as it goes, and calls parse subroutines to get the argument list + expression as necessary. The point I want to make here is that the code you write when implementing a recursive-descent parser mimics the EBNF grammar very closely – this is a very nice property, which makes the parser very easy to write and reason about. I promise that the code you would write, or see, in a true implementation would be very similar to this.
Unfortunately, recursive descent becomes quite clumsy when
it comes to handling operator precedence (e.g. expressing
that /
and *
have higher precendece than +
and -
).
In a recursive-descent context, this is handled by having a
separate rule for each operator precedence level, e.g. in
defining <expr>
, we would have to write something like:
<expression> = ["+"|"-"] <term> {("+"|"-") <term>} .
<term> = <factor> {("*"|"/") <factor>} .
<factor> = <symbol>
| <number>
This implies having these (in my mind, bogus) names that
differentiate what the rule handling +
and -
s is, and
the rule handling *
and /
is. This also implies a
separate function for each of these rules (e.g.
parseTerm()
, parseFactor()
). You could imagine this
being a huge pain – for example, in R, we have 18 different
operator groups + precedence classes. Do you want to write
18 different functions for parsing each of these? I
certainly don’t! It also implies our parser will be
inefficient, as parsing an operator can require deep
recursions until the appropriate rule is found.
With that, I had somewhat given up hope on recursive-descent parsing. It’s obviously doable to implement an entirely recursive-descent based parser for a language, but it’s boring, unwieldy, and slow.
And then I discovered top-down operator precedence (TDOP) parsers (also called Pratt parsers, named for Vaughan Pratt, who introduced this in his 1973 paper Top-Down Operator Precedence.) This is a parsing technique that handles operator precedence, both unary and binary, in an extremely elegant, clean, and performant fashion. I stumbled upon it when I learned about JSLint (a program that identifies ‘code smells’ in JavaScript programs) and the article in which Douglas Crockford describes how he used top-down operator precedence parsing to implement it.
However, Crockford’s article rubs me the wrong way in a few
places, with the use of technical terms like led
(‘left
denotation’) and nud
(‘null denotation’) which are opaque
enough that my brain hiccups whenever I see them. That said,
even if my blog post is (hopefully) a more accessible
introduction, I highly recommend that you read Crockford’s
article next. Alternatively (hat tip to @tjmahr), you
can watch Douglas Crockford’s recent video Syntaxation,
where he discusses TDOP parsing (we share a similar love
of the core of TDOP).
I think the core concepts behind TDOP parsers can be expressed much more plainly. So, here goes. In a language, there are two kinds of symbols:
- Symbols that ‘start’ an expression (S), and
- Symbols that ‘continue’ an expression (C).
For example, given the program below we could denote the symbols:
- x + y * - 4
S S C S C S S
Note that both unary operators and variables can be
encountered at the start of an expression. The binary
operators +
and *
serve to continue, or join, the
expression (such that the program above is indeed a single
expression). Whether an operator is considered unary or
binary depends on whether it’s encountered at the ‘start’ or
‘middle’ of an expression.
We’re going to have two functions for handling parses in the ‘start’ and ‘continue’ states – I’ll name them now, but we’ll develop implementations later.
parseExpressionStart()
: parse routine for tokens that ‘start’ expressions,parseExpressionContinuation()
: parse routine for tokens that ‘continue’ expressions.
Armed with this basis, let’s consider how we could write a
parser for a super simple calculator language. In our
language, we will have integers [0-9]+
, and the binary
operators +
and *
, where *
has a higher precedence
than +
. It’s a tiny language, but has just enough to
force us to properly handle operator precedence, and once
that’s a solved problem, everything else will fall out
naturally.
Unfortunately, we can’t implement a parser without a tokenizer, so let’s get this out of the way first. Briefly, a tokenizer splits a program (as a string) into the base syntactic elements (tokens) that make up the language. We’ll have three functions that handle tokenization:
tokenize()
: takes a program as a string, and returns a tokenized representation,current()
: returns the current active token, andconsume()
: returns the current token, and also advances the token index.
Let’s implement a simple, crude tokenizer:
Clunky, but this will be enough to power our parser. So let’s get back to the fun stuff.
Now, I’m going to show you the top-level parse routine. Hold on to something, because this is going to be rad. With operator-precedence parsing, our parse routine can be expressed in an incredibly simple, elegant form. Honestly, it’s one of the most elegant ways of solving a problem that I’ve ever seen – I don’t often call algorithms beautiful, but this one is. We can implement our top-level parse routine in just 4 lines of code:
This is the engine that gives us a parser that understands binary precedence. Seriously. Yes, a single function with a 4-line function body and we have something that will be able to understand binary precedence. We’ve escaped from the recursion hell that pure recursive-descent parsers have when attempting to handle operator precedence. Even more, this basis will handle new extensions to our language with no problem at all, as you’ll see later. This is the holy grail upon which we can build our parser, and everything will just come together cleanly.
Let’s ignore how it works for now, and just assume that it
will work. Now, let’s talk about our supporting cast. First,
parseExpressionStart()
. For our simple calculator, the
only syntactic element in the language that can occur at the
start of an expression is an integer (we don’t have unary
operators yet), so we just accept that:
Next, our parse routine for things that continue an
expression – for our simple calculator, this is either a
+
or a *
. The key part is that this routine will
generate a call()
object, with the current binary operator
as the ‘head’, the node passed in as the first child, and
the result of a recursive call to parseExpression()
as the
second child. Note that we pass down the binary precedence
of that operator as well:
Finally, our binaryPrecedence()
function will report the
precedence of a binary operator. We include the default 0
case to handle when we reach the end of our input stream –
this signals the parse engine to stop trying to continue, or
join, expressions.
Now, let’s put this all together with a working example. The following chunk is a complete implementation of the parser for our simple calculator program, and the implementation isn’t even 50 lines of code!
God, I love it! It may be a simple language, but with less than 50 lines of code we have a fully functioning parser.
Now, let’s think a bit about how it all works. Let’s trace
out what happens when we get a parseExpression()
call.
Let’s look at our engine, but let’s trace what happens
behind the scenes using R’s trace()
function.
It’s pretty nifty that the tracer output basically mimics
the structure of the actual AST that we receive at the end.
The main thing to notice is how the recursion begins, and
ends – starting at the first
parseExpressionContinuation()
, note that the *
call
becomes a child of that node. Because *
has a higher
precedence, it binds more tightly, and so it indeed binds
2
rather than letting +
take it. Similarly, the *
gets
ownership of 3
over the following +
operator. It’s still
tough for me to wrap my head around it, but this helps.
Let’s now start extending our language. The foundations are solid enough that we can easily extend our parser. Let’s first start by making our parser more generic – we’ll abstract out the concept of operators and precedence:
-
We’ll create a
PRECEDENCE
object that maps the set of available unary and binary operators, and associates them with a precedence, -
We’ll have separate
unaryPrecedence()
andbinaryPrecedence()
functions for retrieving these precedences, and -
We’ll augment our
parseExpressionStart()
function to handle unary operators.
We’ve extended our language to handle unary operators,
as well as the operators *
and /
. Something that’s
even cooler is, since our parser is now entirely powered
by the precedence table, we could even modify that
table at run time to introduce new syntax into the
language. Look, |>
is an operator now!
That’s goddamn rad.
What about control flow and custom forms, you ask? What
if I want to turn my calculator into a real programming
language? These fit easily as well, and this is where our
recursive-descent style routines come in. The key thing
to realize is that these control flow constructs are just
special things that can ‘start’ an expression. So, they
need code that lives in parseExpressionStart()
. Here’s
what that might look like for an R while
loop:
You know what? Let’s go all out and make it possible to add custom constructs at runtime!
We now have a mechanism for building a language at runtime. Mind-blowing! You could imagine taking this to the logical extreme and implementing an R parser – in R – with your own runtime extensions.
Hopefully after walking through with me, you can see just
how exciting and awesome Pratt parsers truly are. Simple,
elegant, extensive, and performant… these are, in my mind,
the unicorns of the parser world, and it’s no surprise that
gcc
and clang
use their own custom-built parsers built
with this same model.
What have I left out? There’s a few things:
- Left-associativity vs. right-associativity of operators,
- Handling and reporting errors (give up? attempt to fix up and continue to parse?),
- What signals the ‘end’ of an expression?
- Function calls.
I’ll direct you to the ‘Further Reading’ for 1), and your own devices for 2), 3) and 4).
One thing you might notice about 3) is that our parser doesn’t actually care – it could easily just parse `1 + 2 3
- 4 5 + 6
as 3 separate expressions, but to be nice to the programmer, we typically force expressions to be separated with
;` or a newline or some other delimiter.
Finally, a hint for 4), though: you can either implement
your own custom parse routine for functions, or you can
treat (
as a binary operator, with the restriction that it
must find an enclosing )
when you finish parsing.
But honestly, this is the core of it – the rest is your own extensions. Parse on!
Further Reading
This blog post is highly inspired by the following posts, but of course presents its implementation in R.
-
Syntaxation: Douglas Crockford’s talk re: TDOP.
-
Top Down Operator Precedence: Douglas Crockford’s use of TDOP in JSLint.
-
Top-down Operator Precedence Parsing: TDOP in Python.
-
Simple Top-Down Parsing in Python: More TDOP in Python.
-
TDOP / Pratt parser in pictures: A nice animated view of how the parse tree is built when using a Pratt parser.
-
Let’s Build a Compiler: An excellent book on how one might construct a compiler for Pascal. It uses a (pure) recursive-descent parser, but also discusses emitting code and more.