../intro-to-parsing

A practical introduction to parsing

Table of contents

If you're reading this, you probably know how to program a computer. You write some text, and use a special thing called a "compiler" (sometimes also "interpreter") to turn that text into instructions for your computer.

I want to attempt to explain a small part of the process of transforming text into those instructions, usually referred to as "parsing", by implementing a parser in Rust in an opinionated way.

Parsing programming languages

As a programmer, you intuitively understand how to recognize different parts of a programming language. It's similar to reading any other form of text. You read it one line at a time, or by scanning for patterns. Each chunk is given some kind of meaning based on its contents.

But a computer doesn't know how to do that without your help. It's your job to break up the text into smaller pieces, the individual syntactic elements of a program. You also have to assign each piece of syntax some sort of meaning. The program which does this is usually called a parser, and is often only one stage in a full compiler.

The parser reads source code, and transforms it into a syntax tree1. Depending on what exactly you're trying to do, this syntax tree can either be concrete or abstract.

A concrete syntax tree (CST) stores everything necessary to preserve the exact appearance of a program: Parentheses, curly brackets, keywords, and so on. An abstract syntax tree (AST) only stores whatever is necessary to preserve the program's meaning, discarding everything else. We'll be using an AST2.

A simple language

Our testbed for parsing will be simple programming language, called simp3.

Here's a simple program:

fn factorial(n) {
  if n <= 1 { n }
  else { n * factorial(n - 1) }
}

let fact_5 = factorial(5);
assert(fact_5 == 120, "5! must be 120");

It's so simple that it's just barely useful. We have:

And not much more. The syntax is heavily inspired by Rust; it's vaguely C-like with curly brackets and semicolons, is "expression-based"4, calls functions fn, and so on.

We have some fancy structured text. What now?

Lexical analysis

To make the process of parsing easier, source text is typically first split up into a list of tokens, referred to as "tokenization" or "lexing" (short for lexical analysis). It's not strictly necessary, but it is useful; our parser becomes a lot simpler by matching on tokens instead of string slices or individual characters.

Each token will need to store what kind of token it is, and a span5. We'll use the span to retrieve the token's lexeme -- its slice of the source code.

struct Token {
    kind: TokenKind,
    span: Span,
}

For brevity, some code snippets are incomplete; the full source code is available on GitHub.

The core of our lexer is a loop, reading one byte6 at a time from our source text. It'll also need to keep track of where in the source code it is.

fn lex(code: &str) -> Vec<Token> {
    let bytes = code.as_bytes();
    let mut pos = 0;
    while pos < bytes.len() {
        let start = pos; // token's starting position
        // ...
    }
}

Next we need to determine if we're looking at a character which is valid in this position. Rust's pattern matching is convenient for this task.

To match one-byte tokens, such as + and -, we only need to match the first byte:

match bytes[pos] {
    b'+' => { /* ... */ }
    b'-' => { /* ... */ }
    // ...
}

For two-byte tokens, we'll need a nested match:

match bytes[pos] {
    // ...
    b'=' => if bytes.get(pos+1) == Some(&b'=') {
        // matched `==`
        pos += 2;
    } else {
        // matched only `=`
        pos += 1;
    }
}

Our final token "category" is an arbitrary-length sequence. Our lexer is greedy, meaning it continues to "append" characters to the current token for as long as it's a valid sequence. We'll use an inner loop for that:

match bytes[pos] {
    // ...

    // identifier
    b'_' | b'a'..=b'z' | b'A'..=b'Z' => {
        pos += 1;

        while let Some(byte) = bytes.get(pos)
            && let b'_' | b'a'..=b'z' | b'A'..=b'Z' = *byte
        {
            pos += 1;
        }

        // ...
    }
}

simp also has a few keywords. After we lex an identifier, we'll have to check if it's a keyword, and output the token kind for that keyword instead.

Once we've "consumed" some valid sequence of characters, we can construct a token:

let end = pos;
Token {
    kind: /*...*/,
    span: Span::from(start..end),
}

Lexical analysis, automated

While I'd be happy to write a lexer by hand, it tends to be a very error-prone, tedious task. Our needs are very basic, so to make our job easier, we'll use a library called Logos. It's easy to use, and generates very fast lexers, the kind which would be very impractical to write by hand.

I intentionally omitted the definition of our TokenKind enum until now. Here it is:

#[derive(Logos)]
#[logos(skip r"[ \t\n\r]+")]
enum TokenKind {
    #[token("fn")] KW_FN,
    #[token("if")] KW_IF,
    #[token("else")] KW_ELSE,
    #[token("let")] KW_LET,

    #[token(";")] TOK_SEMI,
    #[token("(")] TOK_LPAREN,
    #[token(")")] TOK_RPAREN,
    #[token("{")] TOK_LBRACE,
    #[token("}")] TOK_RBRACE,
    #[token(",")] TOK_COMMA,

    #[token("-")] OP_MINUS,
    #[token("+")] OP_PLUS,
    #[token("*")] OP_STAR,
    #[token("/")] OP_SLASH,
    #[token("=")] OP_EQ,
    #[token("||")] OP_OR,
    #[token("&&")] OP_AND,
    #[token("==")] OP_EQEQ,
    #[token("!=")] OP_NEQ,
    #[token("<")] OP_LT,
    #[token("<=")] OP_LE,
    #[token(">")] OP_GT,
    #[token(">=")] OP_GE,
    #[token("!")] OP_BANG,

    #[regex(r"[0-9]+")]
    LIT_INT,

    #[regex(r#""([^"\\]|\\.)*""#)]
    LIT_STR,

    #[regex(r"[a-zA-Z_][a-zA-Z_0-9]*")]
    LIT_IDENT,

    TOK_ERROR,
    TOK_EOF,
}

Logos generates a lexer for us using the annotations on this enum. We have tokens for keywords, punctuation characters, operators, and "literals" identified by a regular expression. We also have two special tokens:

If you're curious about what it expands to, here it is in a gist.

Aside: Storing data in TokenKind

Our TokenKind variants don't carry any data. It is possible to get Logos to parse integers for us, and store the resulting value inside the TokenKind enum. We're not going to use that feature. What we want is for the lexer to only recognize tokens, and to continue even when encountering an error. We'll be much better positioned to report errors in the parser, so we'll process the token's lexeme there, turning a LIT_INT into an i64, for example.

It's also a bit more complicated to match on tokens when they may carry values. In my experience, doing it this way leads to leaner, faster code.


The enum variant naming convention may seem strange at first, but this is something I picked up from reading the source code of rust-analyzer.

The idea is to name each token kind something globally unique, and then use a glob import when using them. Using all-uppercase plus prefixes like KW_ and TOK_ for "groups" of token kinds works well for that.

Unlike rust-analyzer, we use token prefixes instead of suffixes. That way the auto-completion list is much shorter; If you want to match on some keyword, type KW_, and you'll only see the subset of token kinds which are keywords.

Now we can "write" our lexer:

fn lex(src: &str) -> Vec<Token> {
    TokenKind::lexer(src)
        .spanned()
        .map(|item| match item {
            (Ok(kind), span) => Token {
                kind,
                span: span.into(),
            },
            (Err(()), span) => Token {
                kind: TOK_ERROR,
                span: span.into(),
            },
        })
        .chain([Token {
            kind: TOK_EOF,
            span: Span::empty(),
        }])
        .collect()
}

Logos is doing a lot of heavy lifting here. An equivalent lexer would be a few hundred lines of code, and it wouldn't be nearly as fast. At this point, the only reason this function exists is to preprocess what Logos outputs into a slightly nicer form.

Parser == grammar police

Before we talk about parsing, we still need to talk about what we'll be parsing. Human languages have grammar, and so do programming languages. Just like a school teacher, our parser will only accept what is grammatically valid7. For anything else, it will output syntax errors.

We know what simp syntax looks like, but what does its grammar look like? To describe the grammar of human language, we often use more human language to do it. But that won't work for computers.

To describe a programming language grammar, we'll use a notation called Extended Backus-Naur form (EBNF). The E in EBNF effectively means we can do whatever we want, because there is no single8 standard!

Here's what simp's formal grammar looks like:

Program = { Stmt } [ Expr ] ;

Stmt = StmtFn | StmtLet | StmtExpr ;

StmtFn = "fn" Identifier "(" [ ParamList ] ")" Block ;
ParamList = Identifier { "," Identifier } [ "," ] ;

StmtLet = "let" Identifier "=" Expr ";" ;

StmtExpr = Expr ";" ;

Expr = ExprIf | ExprOr ;

ExprIf = "if" Expr Block "else" Block ;

ExprOr = ExprAnd { "||" ExprAnd } ;
ExprAnd = ExprEq { "&&" ExprEq } ;
ExprEq = ExprOrd { ( "==" | "!=" ) ExprOrd } ;
ExprOrd = ExprAdd { ( "<" | "<=" | ">" | ">=" ) ExprAdd } ;
ExprAdd = ExprMul { ( "+" | "-" ) ExprMul } ;
ExprMul = ExprUnary { ( "*" | "/" | "%" ) ExprUnary } ;
ExprUnary = [ "-" | "!" ] ExprPostfix ;
ExprPostfix = ExprPrimary { ExprCall } ;
ExprCall = "(" [ ArgList ] ")" ;
ArgList = Expr { "," Expr } [ "," ] ;

ExprPrimary = INTEGER
        | STRING
        | IDENTIFIER
        | Block
        | "(" Expr ")"
        ;

Block = "{" { Stmt } [ Expr ] "}" ;

To break it down:

It's useful to think of nonterminal as functions which can be called to parse a specific sequence of other symbols. Terminals are tokens, and are indivisible.

We have two main categories of syntax: statements and expressions.

“ Where an expression's main job is to produce a value, a statement's job is to produce an effect

 

Bob Nystrom in his book Crafing Interpreters9

Some parts of the grammar are omitted for brevity, such as identifiers10 and strings11. They tend to expand into quite a few rules, which doesn't really help us here, so we treat them as terminals.

We also have a special case in the parser for a trailing expression in a block of code, which is the implicit return value12 of that block. That case is specified as { Stmt } [ Expr ] in the grammar, but it isn't quite that simple in reality.

And finally, we do support trailing commas13, but the grammar doesn't reflect how they are handled.

A simple parser

The purpose of a parser is:

Given a valid sequence of tokens, produce a syntax tree.

There are many ways to do this. It's a whole field of study, one we're not going to delve very deeply into right now.

The technique I've chosen for this parser is called recursive descent. It's simple, very common, and very effective. Another common technique is called pratt parsing. Sometimes multiple different kinds of parsing techniques are combined together into a single super-parser.

Our parser will read the source code, one token at a time, and traverse the grammar. For each rule we traverse, we'll either successfully match it, or produce an error:

In most of our parser, all we'll have to do is match on the current token, and advance to the next position if we find the right one.

To help us with that, we'll need a cursor to keep track of which token the parser is on:

struct Cursor<'src> {
    code: &'src str,
    tokens: Vec<Token>,
    position: usize,
}

And a few methods on the cursor to work with our token list:

impl<'src> Cursor<'src> {
    /// Advance the cursor to the next token.
    fn advance(&mut self) {
        if self.position + 1 >= self.tokens.len() {
            return;
        }

        self.position += 1;
    }

    /// Returns the token under the cursor.
    fn current(&self) -> Token {
        self.tokens[self.position]
    }

    /// Returns the token before the cursor.
    fn previous(&self) -> Token {
        self.tokens[self.position - 1]
    }

    /// Returns the current token kind,
    /// shorthand for `c.current().kind`.
    fn kind(&self) -> TokenKind {
        self.current().kind
    }

    /// Returns `true` if the current token matches `kind`.
    fn at(&self, kind: TokenKind) -> bool {
        self.current().kind == kind
    }

    /// Returns `true` if the previous token matched `kind`.
    fn was(&self, kind: TokenKind) -> bool {
        self.position > 0 && self.previous().kind == kind
    }

    /// Returns `true` and advances
    /// if the current token matches `kind`.
    fn eat(&mut self, kind: TokenKind) -> bool {
        if self.at(kind) {
            self.advance();
            true
        } else {
            false
        }
    }

    /// Returns the current token if it matches `kind`,
    /// otherwise returns an error.
    fn must(&mut self, kind: TokenKind) -> Result<Token> {
        let current = self.current();
        if self.eat(kind) {
            Ok(current)
        } else {
            error(
                format!(
                    "expected {kind:?}, found {:?}",
                    current.kind,
                ),
                current.span,
            )
            .into()
        }
    }
}

Syntax trees and recursive descent

The entrypoint for our parser will be a parse function. It turns a string into a program, or a syntax error.

fn parse(code: &str) -> Result<Program> {
    let cursor = &mut Cursor {
        code,
        tokens: lex(code),
        position: 0,
    };

    parse_program(c)
}

The process of hand-writing a recursive descent parser involves closely following the grammar, implementing it one nonterminal at a time. We also need to define each kind of syntax node as we go along.

Our entrypoint calls into parse_program. Starting with its syntax tree node:

struct Program {
    body: Vec<Stmt>,
    tail: Option<Expr>,
}

A program is a list of statements. We'll continue parsing statements until we reach the end of the file.

// Program = { Stmt } ;
fn parse_program(c: &mut Cursor) -> Result<Program> {
    let mut body = Vec::new();
    while !c.at(TOK_EOF) {
        body.push(parse_stmt(c)?);
    }

    // ...
}
Aside: Trailing expressions

We're trying to mimic Rust, which can be a bit tricky. A block of code can have a trailing expression. That happens when it ends with an expression, and that expression isn't terminated by a semicolon:

{ { 1 } { 2 } } // valid syntax, implicit return value is `2`

Developing parsers (and compilers in general) often involves thinking about contrived examples like these. Parsers tend to have a lot of emergent behavior. It's your job to rein it in!

You can add parser complexity to allow users of your language to write fewer characters. You can also keep things more verbose and explicit, and make your own life as a compiler developer easier. A verbose feature which solves a problem is still probably better than not solving the problem at all. Sometimes there's a clever way to achieve both fewer characters and a simpler compiler.

In this case, we're adding parser complexity in exchange for the ability to write code like this:

{
    let a = 1;

    {
        // more tightly scoped thing
        let secret = 42;
        foo(a + secret)
    } // no semicolon, despite being an expression

    a // correctly parsed as the trailing expression for this block
}

Because we don't want to require block-like expressions be terminated by a semicolon, we need to do a few things:

  • Any statement which requires a semicolon, such as let, must consume it.
  • Any non-trailing expression in statement position must require a semicolon.

We can determine if an expression in statement position is trailing by checking what comes after it:

  • EOF - it's the end of the file, so it's trailing.
  • } - it's the end of a block, also trailing.
  • Anything else must be another statement or expression, so we require a semicolon to separate them.

This is a consequence of how simp's grammar is laid out. It shows why writing a (mostly) formal grammar is useful! You can reason about the entirety of your syntax, and resolve ambiguities.


A statement is either a function, a variable, or an expression. To determine what we're looking at, we need to match on the current token:

// Stmt = StmtFn | StmtLet | StmtExpr ;

enum Stmt {
    Fn(Box<StmtFn>),
    Let(Box<StmtLet>),
    Expr(Box<Expr>),
}

fn parse_stmt(c: &mut Cursor) -> Result<Stmt> {
    match c.kind() {
        KW_FN => parse_stmt_fn(c),
        KW_LET => parse_stmt_let(c),
        _ => parse_stmt_expr(c),
    }
}

Our AST is recursive; an Expr can contain other Exprs. Without boxing, the types would be infinitely recursive. rustc would not be happy about that:

error[E0072]: recursive types `ast::Stmt` and `ast::StmtFn` have infinite size
  --> src/ast.rs:11:1
   |
11 | pub enum Stmt {
   | ^^^^^^^^^^^^^
12 |     Fn(StmtFn),
   |        ------ recursive without indirection
Aside: Boxing enum variants

Instead of boxing enum variants, we could box Expr and Stmt fields in each syntax node, which would also prevent the type from being infinitely recursive.

Enums in Rust are always as large as their largest variant, and that's the problem with this alternate approach: A sizeable chunk of the memory allocated for our syntax nodes would be wasted.

By boxing variants like we do above, they're all 8 bytes, and each Box allocation is exactly sized to fit the fields of a given syntax node, wasting much less memory overall.


A function has a name, a list of parameters, and a body.

struct StmtFn {
    name: String,
    params: Vec<String>,
    body: Block,
}

fn parse_stmt_fn(c: &mut Cursor) -> Result<Stmt> {
    assert!(c.eat(KW_FN));    
    let name = parse_ident(c)?;
    let params = parse_param_list(c)?;
    let body = parse_block(c)?;
    Ok(Stmt::Fn(Box::new(StmtFn { name, params, body })))
}

parse_stmt_fn can expect the cursor to be on top of a fn keyword, so we assert on it. We're already matching on the keyword in parse_stmt, so a failure to eat KW_FN would be our fault, not the user's. Rule of thumb:

Tokens should be consumed by the node which they belong to.

Identifiers are terminal symbols. To parse one, we must retrieve the lexeme of a LIT_IDENT token:

fn parse_ident(c: &mut Cursor) -> Result<String> {
    let token = c.must(LIT_IDENT)?;
    Ok(c.lexeme(token).to_owned())
}

Parameters are a parenthesized, comma-separated list of identifiers:

fn parse_param_list(c: &mut Cursor) -> Result<Vec<String>> {
    let mut list = Vec::new();
    c.must(TOK_LPAREN)?;
    // stop early if we have an empty list: `()`
    if !c.at(TOK_RPAREN) {
        loop {
            list.push(parse_ident(c)?);
            // stop if there's no comma,
            // or if the comma is trailing.
            if !c.eat(TOK_COMMA) || c.at(TOK_RPAREN) {
                break;
            }
        }
    }
    c.must(TOK_RPAREN)?;
    Ok(list)
}

It's a little more involved, because we also want to allow trailing commas. We'll actually want to re-use that code later for argument lists, so let's break it out right now:

fn parse_paren_list<F, T>(
    c: &mut Cursor,
    mut elem: F,
) -> Result<Vec<T>>
where
    F: FnMut(&mut Cursor) -> Result<T>,
{
    let mut list = Vec::new();
    c.must(TOK_LPAREN)?;
    if !c.at(TOK_RPAREN) {
        loop {
            list.push(elem(c)?);
            if !c.eat(TOK_COMMA) || c.at(TOK_RPAREN) {
                break;
            }
        }
    }
    c.must(TOK_RPAREN)?;
    Ok(list)
}

And use that instead in parse_param_list:

fn parse_param_list(c: &mut Cursor) -> Result<Vec<String>> {
    parse_paren_list(c, parse_ident)
}

A function's body is a block. It has an identical definition to Program:

struct Block {
    body: Vec<Stmt>,
    tail: Option<Expr>, // see aside about trailing expressions
}

And its implementation is largely the same, except that now we also require braces to wrap the contents:

fn parse_block(c: &mut Cursor) -> Result<Block> {
    c.must(TOK_LBRACE)?;
    let mut body = Vec::new();
    while !c.at(TOK_RBRACE) {
        body.push(parse_stmt(c)?);
    }
    c.must(TOK_RBRACE)?;

    // ...
}

Statements tend to be the easiest kind of thing to parse, because they usually begin with a keyword or unique symbol. That gives us a clear starting point, from which we follow the structure of the statement.

Expressions are a little more interesting: They can have sub-expressions.

Parsing expressions

Let's look at arithmetic first. To parse it, we have to think about associativity and precedence. Recall what our grammar looks like:

(* some parts omitted for brevity *)

ExprAdd = ExprMul { ( "+" | "-" ) ExprMul } ;
ExprMul = ExprUnary { ( "*" | "/" | "%" ) ExprUnary } ;
ExprUnary = [ "-" | "!" ] ExprPostfix ;
ExprPostfix = ExprPrimary { ExprCall } ;

ExprPrimary = 
        | IDENTIFIER
        ;
enum Expr {
    Binary(Box<ExprBinary>),
    // ... more expressions
}

struct ExprBinary {
    lhs: Expr,
    op: BinaryOp,
    rhs: Expr,
}

enum BinaryOp {
    Add,
    Subtract,
    // ... other operators
}

fn parse_expr_add(c: &mut Cursor) -> Result<Expr> {
    let mut lhs = parse_expr_mul(c)?;
    loop {
        let op = match c.kind() {
            OP_PLUS => BinaryOp::Add,
            OP_MINUS => BinaryOp::Subtract,
            _ => break,
        };
        c.advance(); // eat `op`
        let rhs = parse_expr_mul(c)?;
        lhs = Expr::Binary(Box::new(ExprBinary { lhs, op, rhs }))
    }
    Ok(lhs)
}

First comes the left side of the expression. Because ExprMul has higher precedence, we'll try to parse it before doing anything else.

When we come back, we'll enter a loop. The loop attempts to match one of the valid operators at this precedence level, and converts them to their corresponding BinaryOp variants.

Next, we look at the right side. Once again parsing the higher precedence thing (ExprMul) first, and only then do we accumulate the constructed expression into the lhs variable.

If at some point we can't match another operator, we exit the loop.

fn parse_expr_mul(c: &mut Cursor) -> Result<Expr> {
    let mut lhs = parse_expr_unary(c)?;
    loop {
        let op = match c.kind() {
            OP_STAR => BinaryOp::Multiply,
            OP_SLASH => BinaryOp::Divide,
            _ => break,
        };
        c.advance(); // eat `op`
        let rhs = parse_expr_unary(c)?;
        lhs = Expr::Binary(Box::new(ExprBinary { lhs, op, rhs }))
    }
    Ok(lhs)
}

This looks very familiar. We're matching on different operators, and calling parse_expr_unary, climbing higher in precedence.

fn parse_expr_unary(c: &mut Cursor) -> Result<Expr> {
    let op = match c.kind() {
        OP_MINUS => UnaryOp::Minus,
        OP_BANG => UnaryOp::Not,
        _ => return parse_expr_postfix(c),
    };
    c.advance(); // eat `op`
    let rhs = parse_expr_unary(c)?;
    Ok(Expr::Unary(Box::new(ExprUnary { op, rhs })))
}

For unary expressions like -a and !a, the operator comes first, and then its rhs sub-expression, where it loops back around to itself. That's almost infinite recursion, but we have a base case: If we don't match any operators, we call parse_expr_postfix.

fn parse_expr_postfix(c: &mut Cursor) -> Result<Expr> {
    let mut expr = parse_expr_primary(c)?;
    while c.at(TOK_LPAREN) {
        expr = parse_expr_call(c, expr)?;
    }
    Ok(expr)
}

The parentheses come after the callee, like f(a, b, c).

fn parse_expr_call(c: &mut Cursor, callee: Expr) -> Result<Expr> {
    let args = parse_arg_list(c)?;
    Ok(Expr::Call(Box::new(ExprCall { callee, args })))
}

fn parse_arg_list(c: &mut Cursor) -> Result<Vec<Expr>> {
    let args = parse_paren_list(c, parse_expr)?;
    Ok(args)
}

There's parse_paren_list again, but this time each element is an expression.

Our parser's call stack currently looks like this:

parse
parse_program
parse_stmt
parse_stmt_expr
parse_expr
.. many `parse_expr_*` functions
parse_expr_postfix
parse_expr_call
parse_arg_list

After calling parse_paren_list, we'll see another parse_expr stack frame, which can descend all the way to parse_expr_call again, recursively.

Our parser can continue going like this for a long time14. Through recursion, we can naturally handle nested function calls like a(b(c(d(1,2,3)))).

fn parse_expr_primary(c: &mut Cursor) -> Result<Expr> {
    match c.kind() {
        LIT_INT => parse_expr_int(c),
        LIT_STR => parse_expr_str(c),
        LIT_IDENT => parse_expr_ident(c),
        TOK_LBRACE => parse_expr_block(c),
        TOK_LPAREN => parse_expr_group(c),
        _ => error(
            format!("unexpected token: {:?}", c.kind()),
            c.current().span,
        )
        .into(),
    }
}

There are more cases here than anywhere else. Let's look at two of them:

fn parse_expr_str(c: &mut Cursor) -> Result<Expr> {
    let token = c.must(LIT_STR)?;
    let value = c
        .lexeme(token)
        .strip_prefix('"')
        .expect("string was lexed without opening quote")
        .strip_suffix('"')
        .expect("string was lexed without closing quote")
        .to_owned();
    Ok(Expr::Str(Box::new(ExprStr { value })))
}

fn parse_expr_block(c: &mut Cursor) -> Result<Expr> {
    let inner = parse_block(c)?;
    Ok(Expr::Block(Box::new(ExprBlock { inner })))
}

Strings come from LIT_STR tokens. We retrieve the token's lexeme, which is the whole string including quotes, and strip those quotes. parse_expr_str and parse_expr_ident work similarly, reading the lexeme, and then parsing it into a value type.

Block expressions are wrappers around blocks, re-using the same parsing code as the function body in parse_stmt_fn. Blocks are lists of statements, which is another place where we recurse in our recursive descent.

This means we handle not just nested calls, but nested blocks and nested function declarations as well.

Testing our parser

I want to end the article with a note about testing. It's vital that you test your code, otherwise you can't know if it actually works.

Testing parsers the usual way with unit tests can be super tedious. You'll likely need hundreds of tests for simple parsers, and thousands for more complex ones15. You should test not only valid syntax, but also which error messages are produced for different kinds of syntax errors.

Imagine having hundreds of these:

#[test]
fn test_parse_stmt_let() {
    let code = "let v = 0"
    assert_eq!(
        parse(code),
        Ok(Program {
            body: vec![
                Stmt::Let(StmtLet {
                    name: "v".into(),
                    value: Expr::Int(ExprInt {
                        value: 0,
                    })
                })
            ],
            tail: None,
        })
    );
}

Any time you change the parser's behavior, or change AST in any way, you have to update all of them. That's a lot of manual, error-prone work! Is your test failing because the parser is buggy, or because you forgot to update the AST, or updated it incorrectly?

It gets even worse for error messages, which are strings rendered from the source code. It's really easy to make a typo, which can be quite difficult to find in an assertion failure.

Here's my recommendation: Use glob tests with snapshots.

Snapshot testing is the best way to test parsers and compilers. The idea is:

These files should be committed to VCS. You can inspect the file to determine what your parser produces for that input, and whenever the output changes, you can inspect the diff.

You no longer have to update hundreds of tests manually whenever anything changes. More importantly whenever something does change, you now only have to review the output, and commit once it looks good. You'll still spend a lot of time reading AST diffs, which is unavoidable.

There are many ways to set this kind of thing up. You can either write your own test harness, or use existing tools:

Combining the two, our parser tests would look like:

fn parse(code: &str) -> String {
    match super::parse(code) {
        Ok(ast) => format!("{ast:#?}"),
        Err(err) => format!("{}", err.render(code)),
    }
}

#[glob_test::glob("./inputs/**/*.simp")]
fn test_parse(path: &std::path::Path) {
    let input = std::fs::read_to_string(path).unwrap();
    insta::assert_snapshot!(parse(&input));
}

To add a new test, all we have to do is create an input file:

src/parser/tests
|- /inputs
   |- /valid
   |  |- stmt_let.simp
   |  |- stmt_fn_single_param.simp
   |  |- stmt_fn_multi_param.simp
   |  |- stmt_fn_nested_fn.simp
   |
   |- /invalid
      |- stmt_let_missing_eq.simp
      |- stmt_fn_missing_paren.simp

And run cargo insta test --review to interactively review the diffs.

Any time we change the behavior of our parser, or change the structure of our AST, we can review and update hundreds of tests in minutes.

Homework

And that's it! We can now successfully parse a simp program. There's are a few things which we haven't covered, and many potential improvements, which are left as an exercise for you, the reader. That is, if you're interested... I promise I won't grade your work.

How should we report errors? You may have noticed an err.render function in our testing setup. It seems like I've conveniently left out its implementation... How would you implement it? You're given the source &str and an error type like the following:

struct Error {
    message: String,
    span: Span,
}
Hint: Error reporting

Try finding the start of the first line in span, and the end of the last line in span. Those two indices may form a new span, which probably covers the entire syntax node that has some kind of issue; Printing those lines with line numbers plus the error message is a really good start! It covers most of what a nice error message should tell the user:

  • Which part of their code has an issue
  • What the issue is

You could expand errors to also support notes and suggestions. Implementing those well would mean more work for our parser, as now it's no longer enough to just determine what is wrong with a given piece of syntax, but also how the user may want to fix it.

For example, if a function param list is missing a closing paren, you could suggest adding the closing paren. This can quickly get out of hand. I recommend reading through rustc's parser. Nice error messages have a huge complexity cost!


Can we report more than one error per file? It would be much nicer for the user of our parser if we attempt to recover from syntax errors, and continue parsing to output as many syntax errors as we can at a time. This is a difficult problem, but there are a few simple additions to our parser which would cover many common cases.

This also ties back into error reporting: The signature of our parse function must change to return Result<Program, Vec<Error>>. Maybe that Vec could be encapsulated into something that also has a render function? You decide!

Hint: Error recovery

We can get pretty far by taking advantage of two things:

  • If the user is writing to the end of a file, there isn't any point in attempting to recover.
  • If the user is writing either at the start or somewhere in the middle of a file, there's a good chance whatever comes after what they just typed contains valid syntax.

If you catch errors coming from parse_stmt, you can then try to recover by consuming tokens (using c.eat) until you find something that either:

  • Ends the statement which failed to parse, like a semicolon.
  • Begins a new statement which could be valid syntax, like fn, let, if, or an opening brace ({).

There's a balancing act here. If you incorrectly determine the beginning of the next potentially valid syntax node, you can report a false error, which is much worse than reporting no error at all. You should aim to report one syntax error per node, at whatever "node" granularity you choose; recovering just parse_stmt is a fine solution!


Our parser isn't very efficient! I will admit I have not benchmarked this thing at all. In this case, we can find obvious flaws even without benchmarks. The biggest one being that the AST representation uses a lot of small allocations: Each Stmt and Expr variant is wrapped in a Box, each identifier and string is an owned String, etc.

Can you come up with a more efficient way to store the AST? Is there a different way you could allocate memory to amortize the cost of so many allocations, without losing the general structure of the AST?

You should profile the parser before attempting to optimize it. Creating good benchmarks is very much a science, but also a bit of an art. Here are some suggestions:

This kind of setup exposes more bottlenecks:

Your entire file won't fit in CPU cache16, which surfaces issues with memory locality.

Modern branch predictors are a little too good at what they do, which can be a nightmare for benchmarking. It's possible that a CPU can near-perfectly learn a small file's structure, resulting in much fewer branch misses, which is not realistic! A typical source file won't consist entirely of the same chunk duplicated hundreds of times, so the corpus consisting of a variety of snippets produces a more realistic workload.

For implementing your benchmarks, I highly recommend divan.

Hint: Improving performance

One really common fix for the excessive boxing is to allocate memory in an arena, using a library like bumpalo. That means adding a 'bump lifetime to every single AST node.

If you also add a 'src lifetime, then the AST can hold references into the input source code &str, and you wouldn't have to allocate at all for identifiers and strings.


Put what you learnt into practice!

Try to add new syntax to simp: Where would it fit into the grammar? How would you parse it? What are its edge cases?

Maybe you don't like simp all that much; that's fine! Designing your own language and its syntax from scratch can be really fun. Take inspiration from other languages, or come up with something novel and strange. Read the source code of various open-source parsers to learn their tricks, or brute-force your way to a solution.

Playground

You can try out the parser we built right here:

In case you'd like a bigger editor, it's also hosted as a separate page.

All the code is available on GitHub.


1

You can also parse straight into executable code.

2

Concrete syntax trees are often used in language servers.

3

Any similarity in name to existing programming languages is purely coincidental, I looked it up and didn't find anything.

4

We're going all in on expressions with no return keyword! This complicates the parser a bit, but also makes it more interesting.

5

Our Span type is like Range<u32>, but it implements Copy. The standard library can't break the Range type, so we're stuck with making our own. It doesn't have first-class syntax, but at least it can still be used to directly index into strings.

6

We could also read one Rust char at a time. We won't allow unicode outside of strings, and spans are easier to produce when dealing with bytes, so we're not doing that.

7

It's possible to write a parser that doesn't stop at anything, and treats errors as part of the syntax tree. Another technique common in language servers!

8

Attempts have been made to standardise it, but nothing seems to have stuck. Different tools accept different syntaxes, and research papers are similarly inconsistent.

9

I cannot recommend this book enough. The web version was my entrypoint into programming languages.

10

An identifier is a name given to some construct, like a function or variable.

11

A string is a sequence of characters, in our case wrapped in double quotes. They're used to store arbitrary text directly in source code.

12

Meaning there is no return keyword which tells the compiler that you want a specific value to be the return value.

13

These may appear in a few places, such as function calls (f(a, b, c,)) and parameter lists (fn f(a, b, c,) {}).

14

Until we hit a stack overflow. There are ways to solve that. We could manually allocate our own stack before entering the parser, and grow it when we are about to run out of space.

15

I counted 23866 .rs files in the tests directory of rustc as of 425a9c0. I don't fully understand the testing architecture, but it's safe to say there are tens of thousands of tests.

16

For most processors, anyway. Some AMD chips now have that much cache, which is kind of terrifying.