Parsing Expression Grammars

Emacs Lisp provides several tools for parsing and matching text, from regular expressions (Regular Expressions) to full left-to-right (a.k.a. LL) grammar parsers (Bovine parser development). Parsing Expression Grammars (PEG) are another approach to text parsing that offer more structure and composability than regular expressions, but less complexity than context-free grammars. A Parsing Expression Grammar (PEG) describes a formal language in terms of a set of rules for recognizing strings in the language. In Emacs, a PEG parser is defined as a list of named rules, each of which matches text patterns and/or contains references to other rules. Parsing is initiated with the function peg-run or the macro peg-parse (see below), and parses text after point in the current buffer, using a given set of rules. Each rule in a PEG is referred to as a parsing expression (PEX), and can be specified a literal string, a regexp-like character range or set, a peg-specific construct resembling an Emacs Lisp function call, a reference to another rule, or a combination of any of these. A grammar is expressed as a tree of rules in which one rule is typically treated as a "root" or "entry-point" rule. For instance:

((number sign digit (* digit))
 (sign   (or "+" "-" ""))
 (digit  [0-9]))

Once defined, grammars can be used to parse text after point in the current buffer, in a number of ways. The peg-parse macro is the simplest:

peg-parse: Match pexs at point.

(peg-parse
  (number sign digit (* digit))
  (sign   (or "+" "-" ""))
  (digit  [0-9]))

While this macro is simple it is also inflexible, as the rules must be written directly into the source code. More flexibility can be gained by using a combination of other functions and macros.

with-peg-rules: Execute body with rules, a list of PEXs, in effect. Within BODY, parsing is initiated with a call to peg-run.
peg-run: This function accepts a single peg-matcher, which is the result of calling peg (see below) on a named rule, usually the entry-point of a larger grammar. At the end of parsing, one of failure-function or success-function is called, depending on whether the parsing succeeded or not. If success-function is provided, it should be a function that receives as its only argument an anonymous function that runs all the actions collected on the stack during parsing. By default this anonymous function is simply executed. If parsing fails, a function provided as failure-function will be called with a list of PEG expressions that failed during parsing. By default this list is discarded.

The peg-matcher passed to peg-run is produced by a call to peg:

peg: Convert pexs into a single peg-matcher suitable for passing to peg-run.

The peg-parse example above expands to a set of calls to these functions, and could be written in full as:

(with-peg-rules
    ((number sign digit (* digit))
     (sign   (or "+" "-" ""))
     (digit  [0-9]))
  (peg-run (peg number)))

This approach allows more explicit control over the "entry-point" of parsing, and allows the combination of rules from different sources. Individual rules can also be defined using a more defun-like syntax, using the macro define-peg-rule:

define-peg-rule: Define name as a PEG rule that accepts args and matches pexs at point.

For instance:

(define-peg-rule digit ()
  [0-9])

Arguments can be supplied to rules by the funcall PEG rule (PEX Definitions). Another possibility is to define a named set of rules with define-peg-ruleset:

define-peg-ruleset: Define name as an identifier for rules.

(define-peg-ruleset number-grammar
  ;; `digit' here references the definition above.
  (number () sign digit (* digit))
  (sign () (or "+" "-" "")))

Rules and rulesets defined this way can be referred to by name in later calls to peg-run or with-peg-rules:

(with-peg-rules number-grammar
  (peg-run (peg number)))

By default, calls to peg-run or peg-parse produce no output: parsing simply moves point. In order to return or otherwise act upon parsed strings, rules can include actions, see Parsing Actions.

PEX Definitions

Parsing expressions can be defined using the following syntax:

(and E1 E2...): A sequence of PEXs that must all be matched. The and form is optional and implicit.
(or E1 E2...): Prioritized choices, meaning that, as in Elisp, the choices are tried in order, and the first successful match is used. Note that this is distinct from context-free grammars, in which selection between multiple matches is indeterminate.
(any): Matches any single character, as the regexp ".".
STRING: A literal string.
(char C): A single character c, as an Elisp character literal.
(* E): Zero or more instances of expression e, as the regexp *. Matching is always "greedy".
(+ E): One or more instances of expression e, as the regexp +. Matching is always "greedy".
(opt E): Zero or one instance of expression e, as the regexp ?.
SYMBOL: A symbol representing a previously-defined PEG rule.
(range CH1 CH2): The character range between ch1 and ch2, as the regexp [CH1-CH2].
[CH1-CH2 "+*" ?x]: A character set, which can include ranges, character literals, or strings of characters.
[ascii cntrl]: A list of named character classes.
(syntax-class NAME): A single syntax class.
(funcall E ARGS...): Call PEX e (previously defined with define-peg-rule) with arguments args.
(null): The empty string.

The following expressions are used as anchors or tests – they do not move point, but return a boolean value which can be used to constrain matches as a way of controlling the parsing process (Writing PEG Rules).

(bob): Beginning of buffer.
(eob): End of buffer.
(bol): Beginning of line.
(eol): End of line.
(bow): Beginning of word.
(eow): End of word.
(bos): Beginning of symbol.
(eos): End of symbol.
(if E): Returns non-nil if parsing PEX e from point succeeds (point is not moved).
(not E): Returns non-nil if parsing PEX e from point fails (point is not moved).
(guard EXP): Treats the value of the Lisp expression exp as a boolean.

Character-class matching can refer to the classes named in peg-char-classes, equivalent to character classes in regular expressions (Character Classes)

Parsing Actions

By default the process of parsing simply moves point in the current buffer, ultimately returning t if the parsing succeeds, and nil if it doesn't. It's also possible to define parsing actions that can run arbitrary Elisp at certain points in the parsed text. These actions can optionally affect something called the parsing stack, which is a list of values returned by the parsing process. These actions only run (and only return values) if the parsing process ultimately succeeds; if it fails the action code is not run at all. Actions can be added anywhere in the definition of a rule. They are distinguished from parsing expressions by an initial backquote (`), followed by a parenthetical form that must contain a pair of hyphens (--) somewhere within it. Symbols to the left of the hyphens are bound to values popped from the stack (they are somewhat analogous to the argument list of a lambda form). Values produced by code to the right of the hyphens are pushed onto the stack (analogous to the return value of the lambda). For instance, the previous grammar can be augmented with actions to return the parsed number as an actual integer:

(with-peg-rules ((number sign digit (* digit
                                       `(a b -- (+ (* a 10) b)))
                         `(sign val -- (* sign val)))
                 (sign (or (and "+" `(-- 1))
                           (and "-" `(-- -1))
                           (and ""  `(-- 1))))
                 (digit [0-9] `(-- (- (char-before) ?0))))
  (peg-run (peg number)))

There must be values on the stack before they can be popped and returned – if there aren't enough stack values to bind to an action's left-hand terms, they will be bound to nil. An action with only right-hand terms will push values to the stack; an action with only left-hand terms will consume (and discard) values from the stack. At the end of parsing, stack values are returned as a flat list. To return the string matched by a PEX (instead of simply moving point over it), a grammar can use a rule like this:

(one-word
  `(-- (point))
  (+ [word])
  `(start -- (buffer-substring start (point))))

The first action above pushes the initial value of point to the stack. The intervening PEX moves point over the next word. The second action pops the previous value from the stack (binding it to the variable start), then uses that value to extract a substring from the buffer and push it to the stack. This pattern is so common that PEG provides a shorthand function that does exactly the above, along with a few other shorthands for common scenarios:

(substring E): Match PEX e and push the matched string onto the stack.
(region E): Match e and push the start and end positions of the matched region onto the stack.
(replace E REPLACEMENT): Match e and replaced the matched region with the string replacement.
(list E): Match e, collect all values produced by e (and its sub-expressions) into a list, and push that list onto the stack. Stack values are typically returned as a flat list; this is a way of "grouping" values together.

Writing PEG Rules

Something to be aware of when writing PEG rules is that they are greedy. Rules which can consume a variable amount of text will always consume the maximum amount possible, even if that causes a rule that might otherwise have matched to fail later on – there is no backtracking. For instance, this rule will never succeed:

(forest (+ "tree" (* [blank])) "tree" (eol))

The PEX (+ "tree" (* [blank])) will consume all the repetitions of the word tree, leaving none to match the final tree. In these situations, the desired result can be obtained by using predicates and guards – namely the not, if and guard expressions – to constrain behavior. For instance:

(forest (+ "tree" (* [blank])) (not (eol)) "tree" (eol))

The if and not operators accept a parsing expression and interpret it as a boolean, without moving point. The contents of a guard operator are evaluated as regular Lisp (not a PEX) and should return a boolean value. A nil value causes the match to fail. Another potentially unexpected behavior is that parsing will move point as far as possible, even if the parsing ultimately fails. This rule:

(end-game "game" (eob))

when run in a buffer containing the text "game over" after point, will move point to just after "game" then halt parsing, returning nil. Successful parsing will always return t, or the contexts of the parsing stack.