- Parsing programming languages
- A simple language
- Lexical analysis
- Lexical analysis, automated
- Parser == grammar police
- A simple parser
- Syntax trees and recursive descent
- Parsing expressions
- Testing our parser
- Homework
- Playground
If you're reading this, you probably know how to program a computer. You write some text, and use a special thing called a "compiler" (sometimes also "interpreter") to turn that text into instructions for your computer.
I want to attempt to explain a small part of the process of transforming text into those instructions, usually referred to as "parsing", by implementing a parser in Rust in an opinionated way.
Parsing programming languages
As a programmer, you intuitively understand how to recognize different parts of a programming language. It's similar to reading any other form of text. You read it one line at a time, or by scanning for patterns. Each chunk is given some kind of meaning based on its contents.
But a computer doesn't know how to do that without your help. It's your job to break up the text into smaller pieces, the individual syntactic elements of a program. You also have to assign each piece of syntax some sort of meaning. The program which does this is usually called a parser, and is often only one stage in a full compiler.
The parser reads source code, and transforms it into a syntax tree1. Depending on what exactly you're trying to do, this syntax tree can either be concrete or abstract.
A concrete syntax tree (CST) stores everything necessary to preserve the exact appearance of a program: Parentheses, curly brackets, keywords, and so on. An abstract syntax tree (AST) only stores whatever is necessary to preserve the program's meaning, discarding everything else. We'll be using an AST2.
A simple language
Our testbed for parsing will be simple programming language, called simp
3.
Here's a simp
le program:
fn factorial(n) {
if n <= 1 { n }
else { n * factorial(n - 1) }
}
let fact_5 = factorial(5);
assert(fact_5 == 120, "5! must be 120");
It's so simple that it's just barely useful. We have:
- Functions, and function calls (including recursion!)
- Variables
- Various operators
- Integers, strings
And not much more. The syntax is heavily inspired by Rust; it's vaguely C-like with curly
brackets and semicolons, is "expression-based"4, calls functions fn
, and so on.
We have some fancy structured text. What now?
Lexical analysis
To make the process of parsing easier, source text is typically first split up into a list of tokens, referred to as "tokenization" or "lexing" (short for lexical analysis). It's not strictly necessary, but it is useful; our parser becomes a lot simpler by matching on tokens instead of string slices or individual characters.
Each token will need to store what kind of token it is, and a span5. We'll use the span to retrieve the token's lexeme -- its slice of the source code.
struct Token {
kind: TokenKind,
span: Span,
}
For brevity, some code snippets are incomplete; the full source code is available on GitHub.
The core of our lexer is a loop, reading one byte6 at a time from our source text. It'll also need to keep track of where in the source code it is.
fn lex(code: &str) -> Vec<Token> {
let bytes = code.as_bytes();
let mut pos = 0;
while pos < bytes.len() {
let start = pos; // token's starting position
// ...
}
}
Next we need to determine if we're looking at a character which is valid in this position. Rust's pattern matching is convenient for this task.
To match one-byte tokens, such as +
and -
, we only need to match the first byte:
match bytes[pos] {
b'+' => { /* ... */ }
b'-' => { /* ... */ }
// ...
}
For two-byte tokens, we'll need a nested match:
match bytes[pos] {
// ...
b'=' => if bytes.get(pos+1) == Some(&b'=') {
// matched `==`
pos += 2;
} else {
// matched only `=`
pos += 1;
}
}
Our final token "category" is an arbitrary-length sequence. Our lexer is greedy, meaning it continues to "append" characters to the current token for as long as it's a valid sequence. We'll use an inner loop for that:
match bytes[pos] {
// ...
// identifier
b'_' | b'a'..=b'z' | b'A'..=b'Z' => {
pos += 1;
while let Some(byte) = bytes.get(pos)
&& let b'_' | b'a'..=b'z' | b'A'..=b'Z' = *byte
{
pos += 1;
}
// ...
}
}
simp
also has a few keywords. After we lex an identifier, we'll have to check if it's a keyword,
and output the token kind for that keyword instead.
Once we've "consumed" some valid sequence of characters, we can construct a token:
let end = pos;
Token {
kind: /*...*/,
span: Span::from(start..end),
}
Lexical analysis, automated
While I'd be happy to write a lexer by hand, it tends to be a very error-prone, tedious task. Our needs are very basic, so to make our job easier, we'll use a library called Logos. It's easy to use, and generates very fast lexers, the kind which would be very impractical to write by hand.
I intentionally omitted the definition of our TokenKind
enum until now. Here it is:
#[derive(Logos)]
#[logos(skip r"[ \t\n\r]+")]
enum TokenKind {
#[token("fn")] KW_FN,
#[token("if")] KW_IF,
#[token("else")] KW_ELSE,
#[token("let")] KW_LET,
#[token(";")] TOK_SEMI,
#[token("(")] TOK_LPAREN,
#[token(")")] TOK_RPAREN,
#[token("{")] TOK_LBRACE,
#[token("}")] TOK_RBRACE,
#[token(",")] TOK_COMMA,
#[token("-")] OP_MINUS,
#[token("+")] OP_PLUS,
#[token("*")] OP_STAR,
#[token("/")] OP_SLASH,
#[token("=")] OP_EQ,
#[token("||")] OP_OR,
#[token("&&")] OP_AND,
#[token("==")] OP_EQEQ,
#[token("!=")] OP_NEQ,
#[token("<")] OP_LT,
#[token("<=")] OP_LE,
#[token(">")] OP_GT,
#[token(">=")] OP_GE,
#[token("!")] OP_BANG,
#[regex(r"[0-9]+")]
LIT_INT,
#[regex(r#""([^"\\]|\\.)*""#)]
LIT_STR,
#[regex(r"[a-zA-Z_][a-zA-Z_0-9]*")]
LIT_IDENT,
TOK_ERROR,
TOK_EOF,
}
Logos
generates a lexer for us using the annotations on this enum. We have tokens for keywords,
punctuation characters, operators, and "literals" identified by a regular expression.
We also have two special tokens:
TOK_ERROR
, produced when the lexer can't match a given part of the source code against any token, andTOK_EOF
, produced when the lexer reaches the end of the file.
If you're curious about what it expands to, here it is in a gist.
Aside: Storing data in TokenKind
Our TokenKind
variants don't carry any data. It is possible to get Logos
to parse integers for us, and store the resulting value
inside the TokenKind
enum. We're not going to use that feature. What we want is for the lexer to only recognize tokens, and to
continue even when encountering an error. We'll be much better positioned to report errors in the parser, so we'll process the token's
lexeme there, turning a LIT_INT
into an i64
, for example.
It's also a bit more complicated to match on tokens when they may carry values. In my experience, doing it this way leads to leaner, faster code.
The enum variant naming convention may seem strange at first, but this is something I picked up from reading the source code of rust-analyzer.
The idea is to name each token kind something globally unique, and then use a glob import when using them.
Using all-uppercase plus prefixes like KW_
and TOK_
for "groups" of token kinds works well for that.
Unlike rust-analyzer, we use token prefixes instead of suffixes. That way the auto-completion list is much shorter;
If you want to match on some keyword, type KW_
, and you'll only see the subset of token kinds which are keywords.
Now we can "write" our lexer:
fn lex(src: &str) -> Vec<Token> {
TokenKind::lexer(src)
.spanned()
.map(|item| match item {
(Ok(kind), span) => Token {
kind,
span: span.into(),
},
(Err(()), span) => Token {
kind: TOK_ERROR,
span: span.into(),
},
})
.chain([Token {
kind: TOK_EOF,
span: Span::empty(),
}])
.collect()
}
Logos
is doing a lot of heavy lifting here. An equivalent lexer would be a few hundred lines of code,
and it wouldn't be nearly as fast. At this point, the only reason this function exists is to preprocess
what Logos
outputs into a slightly nicer form.
Parser == grammar police
Before we talk about parsing, we still need to talk about what we'll be parsing. Human languages have grammar, and so do programming languages. Just like a school teacher, our parser will only accept what is grammatically valid7. For anything else, it will output syntax errors.
We know what simp
syntax looks like, but what does its grammar look like? To describe the grammar of human language,
we often use more human language to do it. But that won't work for computers.
To describe a programming language grammar, we'll use a notation called Extended Backus-Naur form (EBNF). The E in EBNF effectively means we can do whatever we want, because there is no single8 standard!
Here's what simp
's formal grammar looks like:
Program = { Stmt } [ Expr ] ;
Stmt = StmtFn | StmtLet | StmtExpr ;
StmtFn = "fn" Identifier "(" [ ParamList ] ")" Block ;
ParamList = Identifier { "," Identifier } [ "," ] ;
StmtLet = "let" Identifier "=" Expr ";" ;
StmtExpr = Expr ";" ;
Expr = ExprIf | ExprOr ;
ExprIf = "if" Expr Block "else" Block ;
ExprOr = ExprAnd { "||" ExprAnd } ;
ExprAnd = ExprEq { "&&" ExprEq } ;
ExprEq = ExprOrd { ( "==" | "!=" ) ExprOrd } ;
ExprOrd = ExprAdd { ( "<" | "<=" | ">" | ">=" ) ExprAdd } ;
ExprAdd = ExprMul { ( "+" | "-" ) ExprMul } ;
ExprMul = ExprUnary { ( "*" | "/" | "%" ) ExprUnary } ;
ExprUnary = [ "-" | "!" ] ExprPostfix ;
ExprPostfix = ExprPrimary { ExprCall } ;
ExprCall = "(" [ ArgList ] ")" ;
ArgList = Expr { "," Expr } [ "," ] ;
ExprPrimary = INTEGER
| STRING
| IDENTIFIER
| Block
| "(" Expr ")"
;
Block = "{" { Stmt } [ Expr ] "}" ;
To break it down:
name = ... ;
is a nonterminal.- Anything in
"quotes"
orUPPERCASE
is a terminal. - Anything in
[ brackets ]
is optional. - Anything in
{ braces }
is repeated, but also optional. left | right
means "left or right", but not both. It can be chained.
It's useful to think of nonterminal as functions which can be called to parse a specific sequence of other symbols. Terminals are tokens, and are indivisible.
We have two main categories of syntax: statements and expressions.
“ Where an expression's main job is to produce a value, a statement's job is to produce an effect „
Bob Nystrom in his book Crafing Interpreters9
Some parts of the grammar are omitted for brevity, such as identifiers10 and strings11. They tend to expand into quite a few rules, which doesn't really help us here, so we treat them as terminals.
We also have a special case in the parser for a trailing expression in a block of code, which is the
implicit return value12 of that block. That case is specified as { Stmt } [ Expr ]
in the grammar, but it isn't quite that simple in reality.
And finally, we do support trailing commas13, but the grammar doesn't reflect how they are handled.
A simp
le parser
The purpose of a parser is:
Given a valid sequence of tokens, produce a syntax tree.
There are many ways to do this. It's a whole field of study, one we're not going to delve very deeply into right now.
The technique I've chosen for this parser is called recursive descent. It's simple, very common, and very effective. Another common technique is called pratt parsing. Sometimes multiple different kinds of parsing techniques are combined together into a single super-parser.
Our parser will read the source code, one token at a time, and traverse the grammar. For each rule we traverse, we'll either successfully match it, or produce an error:
In most of our parser, all we'll have to do is match on the current token, and advance to the next position if we find the right one.
To help us with that, we'll need a cursor to keep track of which token the parser is on:
struct Cursor<'src> {
code: &'src str,
tokens: Vec<Token>,
position: usize,
}
And a few methods on the cursor to work with our token list:
impl<'src> Cursor<'src> {
/// Advance the cursor to the next token.
fn advance(&mut self) {
if self.position + 1 >= self.tokens.len() {
return;
}
self.position += 1;
}
/// Returns the token under the cursor.
fn current(&self) -> Token {
self.tokens[self.position]
}
/// Returns the token before the cursor.
fn previous(&self) -> Token {
self.tokens[self.position - 1]
}
/// Returns the current token kind,
/// shorthand for `c.current().kind`.
fn kind(&self) -> TokenKind {
self.current().kind
}
/// Returns `true` if the current token matches `kind`.
fn at(&self, kind: TokenKind) -> bool {
self.current().kind == kind
}
/// Returns `true` if the previous token matched `kind`.
fn was(&self, kind: TokenKind) -> bool {
self.position > 0 && self.previous().kind == kind
}
/// Returns `true` and advances
/// if the current token matches `kind`.
fn eat(&mut self, kind: TokenKind) -> bool {
if self.at(kind) {
self.advance();
true
} else {
false
}
}
/// Returns the current token if it matches `kind`,
/// otherwise returns an error.
fn must(&mut self, kind: TokenKind) -> Result<Token> {
let current = self.current();
if self.eat(kind) {
Ok(current)
} else {
error(
format!(
"expected {kind:?}, found {:?}",
current.kind,
),
current.span,
)
.into()
}
}
}
Syntax trees and recursive descent
The entrypoint for our parser will be a parse
function. It turns a string into a program, or a syntax error.
fn parse(code: &str) -> Result<Program> {
let cursor = &mut Cursor {
code,
tokens: lex(code),
position: 0,
};
parse_program(c)
}
The process of hand-writing a recursive descent parser involves closely following the grammar, implementing it one nonterminal at a time. We also need to define each kind of syntax node as we go along.
Our entrypoint calls into parse_program
. Starting with its syntax tree node:
struct Program {
body: Vec<Stmt>,
tail: Option<Expr>,
}
A program is a list of statements. We'll continue parsing statements until we reach the end of the file.
// Program = { Stmt } ;
fn parse_program(c: &mut Cursor) -> Result<Program> {
let mut body = Vec::new();
while !c.at(TOK_EOF) {
body.push(parse_stmt(c)?);
}
// ...
}
Aside: Trailing expressions
We're trying to mimic Rust, which can be a bit tricky. A block of code can have a trailing expression. That happens when it ends with an expression, and that expression isn't terminated by a semicolon:
{ { 1 } { 2 } } // valid syntax, implicit return value is `2`
Developing parsers (and compilers in general) often involves thinking about contrived examples like these. Parsers tend to have a lot of emergent behavior. It's your job to rein it in!
You can add parser complexity to allow users of your language to write fewer characters. You can also keep things more verbose and explicit, and make your own life as a compiler developer easier. A verbose feature which solves a problem is still probably better than not solving the problem at all. Sometimes there's a clever way to achieve both fewer characters and a simpler compiler.
In this case, we're adding parser complexity in exchange for the ability to write code like this:
{
let a = 1;
{
// more tightly scoped thing
let secret = 42;
foo(a + secret)
} // no semicolon, despite being an expression
a // correctly parsed as the trailing expression for this block
}
Because we don't want to require block-like expressions be terminated by a semicolon, we need to do a few things:
- Any statement which requires a semicolon, such as
let
, must consume it. - Any non-trailing expression in statement position must require a semicolon.
We can determine if an expression in statement position is trailing by checking what comes after it:
EOF
- it's the end of the file, so it's trailing.}
- it's the end of a block, also trailing.- Anything else must be another statement or expression, so we require a semicolon to separate them.
This is a consequence of how simp
's grammar is laid out. It shows why writing a (mostly) formal grammar is useful!
You can reason about the entirety of your syntax, and resolve ambiguities.
A statement is either a function, a variable, or an expression. To determine what we're looking at, we need to match on the current token:
// Stmt = StmtFn | StmtLet | StmtExpr ;
enum Stmt {
Fn(Box<StmtFn>),
Let(Box<StmtLet>),
Expr(Box<Expr>),
}
fn parse_stmt(c: &mut Cursor) -> Result<Stmt> {
match c.kind() {
KW_FN => parse_stmt_fn(c),
KW_LET => parse_stmt_let(c),
_ => parse_stmt_expr(c),
}
}
Our AST is recursive; an Expr
can contain other Expr
s.
Without boxing, the types would be infinitely recursive. rustc
would not be happy about that:
error[E0072]: recursive types `ast::Stmt` and `ast::StmtFn` have infinite size
--> src/ast.rs:11:1
|
11 | pub enum Stmt {
| ^^^^^^^^^^^^^
12 | Fn(StmtFn),
| ------ recursive without indirection
Aside: Boxing enum variants
Instead of boxing enum variants, we could box Expr
and Stmt
fields in each
syntax node, which would also prevent the type from being infinitely recursive.
Enums in Rust are always as large as their largest variant, and that's the problem with this alternate approach: A sizeable chunk of the memory allocated for our syntax nodes would be wasted.
By boxing variants like we do above, they're all 8 bytes, and each Box
allocation is exactly sized to fit
the fields of a given syntax node, wasting much less memory overall.
A function has a name, a list of parameters, and a body.
struct StmtFn {
name: String,
params: Vec<String>,
body: Block,
}
fn parse_stmt_fn(c: &mut Cursor) -> Result<Stmt> {
assert!(c.eat(KW_FN));
let name = parse_ident(c)?;
let params = parse_param_list(c)?;
let body = parse_block(c)?;
Ok(Stmt::Fn(Box::new(StmtFn { name, params, body })))
}
parse_stmt_fn
can expect the cursor to be on top of a fn
keyword, so we assert on it.
We're already matching on the keyword in parse_stmt
, so a failure to eat KW_FN
would be
our fault, not the user's. Rule of thumb:
Tokens should be consumed by the node which they belong to.
Identifiers are terminal symbols. To parse one, we must retrieve the lexeme of a LIT_IDENT
token:
fn parse_ident(c: &mut Cursor) -> Result<String> {
let token = c.must(LIT_IDENT)?;
Ok(c.lexeme(token).to_owned())
}
Parameters are a parenthesized, comma-separated list of identifiers:
fn parse_param_list(c: &mut Cursor) -> Result<Vec<String>> {
let mut list = Vec::new();
c.must(TOK_LPAREN)?;
// stop early if we have an empty list: `()`
if !c.at(TOK_RPAREN) {
loop {
list.push(parse_ident(c)?);
// stop if there's no comma,
// or if the comma is trailing.
if !c.eat(TOK_COMMA) || c.at(TOK_RPAREN) {
break;
}
}
}
c.must(TOK_RPAREN)?;
Ok(list)
}
It's a little more involved, because we also want to allow trailing commas. We'll actually want to re-use that code later for argument lists, so let's break it out right now:
fn parse_paren_list<F, T>(
c: &mut Cursor,
mut elem: F,
) -> Result<Vec<T>>
where
F: FnMut(&mut Cursor) -> Result<T>,
{
let mut list = Vec::new();
c.must(TOK_LPAREN)?;
if !c.at(TOK_RPAREN) {
loop {
list.push(elem(c)?);
if !c.eat(TOK_COMMA) || c.at(TOK_RPAREN) {
break;
}
}
}
c.must(TOK_RPAREN)?;
Ok(list)
}
And use that instead in parse_param_list
:
fn parse_param_list(c: &mut Cursor) -> Result<Vec<String>> {
parse_paren_list(c, parse_ident)
}
A function's body is a block. It has an identical definition to Program
:
struct Block {
body: Vec<Stmt>,
tail: Option<Expr>, // see aside about trailing expressions
}
And its implementation is largely the same, except that now we also require braces to wrap the contents:
fn parse_block(c: &mut Cursor) -> Result<Block> {
c.must(TOK_LBRACE)?;
let mut body = Vec::new();
while !c.at(TOK_RBRACE) {
body.push(parse_stmt(c)?);
}
c.must(TOK_RBRACE)?;
// ...
}
Statements tend to be the easiest kind of thing to parse, because they usually begin with a keyword or unique symbol. That gives us a clear starting point, from which we follow the structure of the statement.
Expressions are a little more interesting: They can have sub-expressions.
Parsing expressions
Let's look at arithmetic first. To parse it, we have to think about associativity and precedence. Recall what our grammar looks like:
(* some parts omitted for brevity *)
ExprAdd = ExprMul { ( "+" | "-" ) ExprMul } ;
ExprMul = ExprUnary { ( "*" | "/" | "%" ) ExprUnary } ;
ExprUnary = [ "-" | "!" ] ExprPostfix ;
ExprPostfix = ExprPrimary { ExprCall } ;
ExprPrimary =
| IDENTIFIER
;
- If a rule A "contains" another rule B, then rule A has lower precedence.
For example,
+
has lower precedence than*
. - In the absence of parentheses, operators with higher precedence are grouped
first.
a + b * d
should parse asa + (b * d)
.
- Associativity tells us how operators of the same precedence are grouped;
for left-associative operators like
+
and*
, the left side takes precedence.a + b + d
should parse as(a + b) + d
.
enum Expr {
Binary(Box<ExprBinary>),
// ... more expressions
}
struct ExprBinary {
lhs: Expr,
op: BinaryOp,
rhs: Expr,
}
enum BinaryOp {
Add,
Subtract,
// ... other operators
}
fn parse_expr_add(c: &mut Cursor) -> Result<Expr> {
let mut lhs = parse_expr_mul(c)?;
loop {
let op = match c.kind() {
OP_PLUS => BinaryOp::Add,
OP_MINUS => BinaryOp::Subtract,
_ => break,
};
c.advance(); // eat `op`
let rhs = parse_expr_mul(c)?;
lhs = Expr::Binary(Box::new(ExprBinary { lhs, op, rhs }))
}
Ok(lhs)
}
First comes the left side of the expression. Because ExprMul
has higher precedence, we'll try to
parse it before doing anything else.
When we come back, we'll enter a loop.
The loop attempts to match one of the valid operators at this precedence
level, and converts them to their corresponding BinaryOp
variants.
Next, we look at the right side. Once again parsing the higher precedence
thing (ExprMul
) first, and only then do we accumulate the constructed expression
into the lhs
variable.
If at some point we can't match another operator, we exit the loop.
fn parse_expr_mul(c: &mut Cursor) -> Result<Expr> {
let mut lhs = parse_expr_unary(c)?;
loop {
let op = match c.kind() {
OP_STAR => BinaryOp::Multiply,
OP_SLASH => BinaryOp::Divide,
_ => break,
};
c.advance(); // eat `op`
let rhs = parse_expr_unary(c)?;
lhs = Expr::Binary(Box::new(ExprBinary { lhs, op, rhs }))
}
Ok(lhs)
}
This looks very familiar. We're matching on different operators, and calling parse_expr_unary
,
climbing higher in precedence.
fn parse_expr_unary(c: &mut Cursor) -> Result<Expr> {
let op = match c.kind() {
OP_MINUS => UnaryOp::Minus,
OP_BANG => UnaryOp::Not,
_ => return parse_expr_postfix(c),
};
c.advance(); // eat `op`
let rhs = parse_expr_unary(c)?;
Ok(Expr::Unary(Box::new(ExprUnary { op, rhs })))
}
For unary expressions like -a
and !a
, the operator comes first, and then its rhs
sub-expression,
where it loops back around to itself. That's almost infinite recursion, but we have a base case:
If we don't match any operators, we call parse_expr_postfix
.
fn parse_expr_postfix(c: &mut Cursor) -> Result<Expr> {
let mut expr = parse_expr_primary(c)?;
while c.at(TOK_LPAREN) {
expr = parse_expr_call(c, expr)?;
}
Ok(expr)
}
The parentheses come after the callee, like f(a, b, c)
.
fn parse_expr_call(c: &mut Cursor, callee: Expr) -> Result<Expr> {
let args = parse_arg_list(c)?;
Ok(Expr::Call(Box::new(ExprCall { callee, args })))
}
fn parse_arg_list(c: &mut Cursor) -> Result<Vec<Expr>> {
let args = parse_paren_list(c, parse_expr)?;
Ok(args)
}
There's parse_paren_list
again, but this time each element is an expression.
Our parser's call stack currently looks like this:
parse
parse_program
parse_stmt
parse_stmt_expr
parse_expr
.. many `parse_expr_*` functions
parse_expr_postfix
parse_expr_call
parse_arg_list
After calling parse_paren_list
, we'll see another parse_expr
stack frame, which
can descend all the way to parse_expr_call
again, recursively.
Our parser can continue going like this for a long time14. Through recursion, we can
naturally handle nested function calls like a(b(c(d(1,2,3))))
.
fn parse_expr_primary(c: &mut Cursor) -> Result<Expr> {
match c.kind() {
LIT_INT => parse_expr_int(c),
LIT_STR => parse_expr_str(c),
LIT_IDENT => parse_expr_ident(c),
TOK_LBRACE => parse_expr_block(c),
TOK_LPAREN => parse_expr_group(c),
_ => error(
format!("unexpected token: {:?}", c.kind()),
c.current().span,
)
.into(),
}
}
There are more cases here than anywhere else. Let's look at two of them:
fn parse_expr_str(c: &mut Cursor) -> Result<Expr> {
let token = c.must(LIT_STR)?;
let value = c
.lexeme(token)
.strip_prefix('"')
.expect("string was lexed without opening quote")
.strip_suffix('"')
.expect("string was lexed without closing quote")
.to_owned();
Ok(Expr::Str(Box::new(ExprStr { value })))
}
fn parse_expr_block(c: &mut Cursor) -> Result<Expr> {
let inner = parse_block(c)?;
Ok(Expr::Block(Box::new(ExprBlock { inner })))
}
Strings come from LIT_STR
tokens. We retrieve the token's lexeme, which is the whole string including quotes,
and strip those quotes. parse_expr_str
and parse_expr_ident
work similarly, reading the lexeme, and then
parsing it into a value type.
Block expressions are wrappers around blocks, re-using the same parsing code as the function body in parse_stmt_fn
.
Blocks are lists of statements, which is another place where we recurse in our recursive descent.
This means we handle not just nested calls, but nested blocks and nested function declarations as well.
Testing our parser
I want to end the article with a note about testing. It's vital that you test your code, otherwise you can't know if it actually works.
Testing parsers the usual way with unit tests can be super tedious. You'll likely need hundreds of tests for simple parsers, and thousands for more complex ones15. You should test not only valid syntax, but also which error messages are produced for different kinds of syntax errors.
Imagine having hundreds of these:
#[test]
fn test_parse_stmt_let() {
let code = "let v = 0"
assert_eq!(
parse(code),
Ok(Program {
body: vec![
Stmt::Let(StmtLet {
name: "v".into(),
value: Expr::Int(ExprInt {
value: 0,
})
})
],
tail: None,
})
);
}
Any time you change the parser's behavior, or change AST in any way, you have to update all of them. That's a lot of manual, error-prone work! Is your test failing because the parser is buggy, or because you forgot to update the AST, or updated it incorrectly?
It gets even worse for error messages, which are strings rendered from the source code. It's really easy to make a typo, which can be quite difficult to find in an assertion failure.
Here's my recommendation: Use glob tests with snapshots.
Snapshot testing is the best way to test parsers and compilers. The idea is:
- Parse some code
- Render the result to a string
- On success, you'll render the AST
- On error, render the error message like you'd display it in a terminal
- Store the rendered result in a file
These files should be committed to VCS. You can inspect the file to determine what your parser produces for that input, and whenever the output changes, you can inspect the diff.
You no longer have to update hundreds of tests manually whenever anything changes. More importantly whenever something does change, you now only have to review the output, and commit once it looks good. You'll still spend a lot of time reading AST diffs, which is unavoidable.
There are many ways to set this kind of thing up. You can either write your own test harness, or use existing tools:
- insta is a magnificent snapshot testing library, and it comes with a nice CLI tool which lets you review and accept diffs interactively.
- glob_test exports a proc macro which generates a test for each file matching a glob pattern. (disclaimer: this is my own library)
Combining the two, our parser tests would look like:
fn parse(code: &str) -> String {
match super::parse(code) {
Ok(ast) => format!("{ast:#?}"),
Err(err) => format!("{}", err.render(code)),
}
}
#[glob_test::glob("./inputs/**/*.simp")]
fn test_parse(path: &std::path::Path) {
let input = std::fs::read_to_string(path).unwrap();
insta::assert_snapshot!(parse(&input));
}
To add a new test, all we have to do is create an input file:
src/parser/tests
|- /inputs
|- /valid
| |- stmt_let.simp
| |- stmt_fn_single_param.simp
| |- stmt_fn_multi_param.simp
| |- stmt_fn_nested_fn.simp
|
|- /invalid
|- stmt_let_missing_eq.simp
|- stmt_fn_missing_paren.simp
And run cargo insta test --review
to interactively review the diffs.
Any time we change the behavior of our parser, or change the structure of our AST, we can review and update hundreds of tests in minutes.
Homework
And that's it! We can now successfully parse a simp
program. There's are a few things which we haven't covered, and
many potential improvements, which are left as an exercise for you, the reader. That is, if you're interested...
I promise I won't grade your work.
How should we report errors? You may have noticed an err.render
function in our testing setup.
It seems like I've conveniently left out its implementation... How would you implement it?
You're given the source &str
and an error type like the following:
struct Error {
message: String,
span: Span,
}
Hint: Error reporting
Try finding the start of the first line in span
, and the end of the last line in span
. Those two indices
may form a new span, which probably covers the entire syntax node that has some kind of issue; Printing those
lines with line numbers plus the error message is a really good start! It covers most of what a nice error message
should tell the user:
- Which part of their code has an issue
- What the issue is
You could expand errors to also support notes and suggestions. Implementing those well would mean more work for our parser, as now it's no longer enough to just determine what is wrong with a given piece of syntax, but also how the user may want to fix it.
For example, if a function param list is missing a closing paren, you could suggest adding the closing paren.
This can quickly get out of hand. I recommend reading through rustc
's parser.
Nice error messages have a huge complexity cost!
Can we report more than one error per file? It would be much nicer for the user of our parser if we attempt to recover from syntax errors, and continue parsing to output as many syntax errors as we can at a time. This is a difficult problem, but there are a few simple additions to our parser which would cover many common cases.
This also ties back into error reporting: The signature of our parse
function must change to return Result<Program, Vec<Error>>
.
Maybe that Vec
could be encapsulated into something that also has a render
function? You decide!
Hint: Error recovery
We can get pretty far by taking advantage of two things:
- If the user is writing to the end of a file, there isn't any point in attempting to recover.
- If the user is writing either at the start or somewhere in the middle of a file, there's a good chance whatever comes after what they just typed contains valid syntax.
If you catch errors coming from parse_stmt
, you can then try to recover by consuming tokens (using c.eat
)
until you find something that either:
- Ends the statement which failed to parse, like a semicolon.
- Begins a new statement which could be valid syntax, like
fn
,let
,if
, or an opening brace ({
).
There's a balancing act here. If you incorrectly determine the beginning of the next potentially valid syntax node,
you can report a false error, which is much worse than reporting no error at all. You should aim to report
one syntax error per node, at whatever "node" granularity you choose; recovering just parse_stmt
is a fine solution!
Our parser isn't very efficient!
I will admit I have not benchmarked this thing at all. In this case, we can find obvious flaws even without benchmarks.
The biggest one being that the AST representation uses a lot of small allocations: Each Stmt
and Expr
variant is wrapped in a Box
,
each identifier and string is an owned String
, etc.
Can you come up with a more efficient way to store the AST? Is there a different way you could allocate memory to amortize the cost of so many allocations, without losing the general structure of the AST?
You should profile the parser before attempting to optimize it. Creating good benchmarks is very much a science, but also a bit of an art. Here are some suggestions:
- Write a large input source file by concatenating different chunks of valid syntax in some random pattern.
- Make the input file very large, at least 50 MiB.
- Try to include as many different syntax nodes in the file as possible, to properly exercise different code paths throughout the parser.
This kind of setup exposes more bottlenecks:
Your entire file won't fit in CPU cache16, which surfaces issues with memory locality.
Modern branch predictors are a little too good at what they do, which can be a nightmare for benchmarking. It's possible that a CPU can near-perfectly learn a small file's structure, resulting in much fewer branch misses, which is not realistic! A typical source file won't consist entirely of the same chunk duplicated hundreds of times, so the corpus consisting of a variety of snippets produces a more realistic workload.
For implementing your benchmarks, I highly recommend divan.
Hint: Improving performance
One really common fix for the excessive boxing is to allocate memory in an arena,
using a library like bumpalo
.
That means adding a 'bump
lifetime to every single AST node.
If you also add a 'src
lifetime, then the AST can hold references into the input source code &str
,
and you wouldn't have to allocate at all for identifiers and strings.
Put what you learnt into practice!
Try to add new syntax to simp
: Where would it fit into the grammar? How would you parse it?
What are its edge cases?
Maybe you don't like simp
all that much; that's fine! Designing your own language and its syntax
from scratch can be really fun.
Take inspiration from other languages, or come up with something novel and strange.
Read the source code of various open-source parsers to learn their tricks, or brute-force your
way to a solution.
Playground
You can try out the parser we built right here:
In case you'd like a bigger editor, it's also hosted as a separate page.
All the code is available on GitHub.
You can also parse straight into executable code.
Concrete syntax trees are often used in language servers.
Any similarity in name to existing programming languages is purely coincidental, I looked it up and didn't find anything.
We're going all in on expressions with no return
keyword! This complicates the parser a bit, but also makes it more
interesting.
Our Span
type is like Range<u32>
, but it implements Copy
. The standard library can't break the Range
type, so
we're stuck with making our own. It doesn't have first-class syntax, but at least it can still be used to directly index into strings.
We could also read one Rust char
at a time. We won't allow unicode outside of strings, and spans are easier to produce when dealing
with bytes, so we're not doing that.
It's possible to write a parser that doesn't stop at anything, and treats errors as part of the syntax tree. Another technique common in language servers!
Attempts have been made to standardise it, but nothing seems to have stuck. Different tools accept different syntaxes, and research papers are similarly inconsistent.
I cannot recommend this book enough. The web version was my entrypoint into programming languages.
An identifier is a name given to some construct, like a function or variable.
A string is a sequence of characters, in our case wrapped in double quotes. They're used to store arbitrary text directly in source code.
Meaning there is no return
keyword which tells the compiler that you want a specific value to be the return value.
These may appear in a few places, such as function calls (f(a, b, c,)
) and parameter lists (fn f(a, b, c,) {}
).
Until we hit a stack overflow. There are ways to solve that. We could manually allocate our own stack before entering the parser, and grow it when we are about to run out of space.
I counted 23866 .rs
files in the tests
directory of rustc
as of 425a9c0.
I don't fully understand the testing architecture, but it's safe to say there are tens of thousands of tests.
For most processors, anyway. Some AMD chips now have that much cache, which is kind of terrifying.