Parser Combinators: a Walkthrough

Or: Write you a Parsec for Great Good

Most people’s journey with Haskell will follow the same first steps. First, you need to get acquainted with the language itself, especially if you arrive at it without any knowledge of functional programming: its unusual syntax, laziness, the lack of mutable state… From there, you can move to functions and type declarations, familiarize yourself with the Prelude functions (and their known warts), and start writing your first programs. The next big step is, of course, monads, the infamous m-word^[1], and how we use them to structure our programs.

But then, when you’ve started to develop an intuition for monads… the path becomes unclear. There’s a lot more that could be covered, but none of it is as fundamental. There’s roughly three categories of topics one might want to explore:

Going further with the language itself: learning about language extensions, the theory behind them, and the advanced features they provide.
Learning more about the different ways to structure a program (such as monad transformers or effect systems).
Exploring some of the most used libraries, and understanding what makes them so ubiquitous: QuickCheck, Lens… and Parsec.

Today, I want to explore Parsec, and most specifically how Parsec works. Parsing is ubiquitous, and most Haskell programs will use Parsec or one of its variants (megaparsec or attoparsec). While it’s possible to use those libraries without caring about how they work, I think it’s fascinating to develop a mental model for their inner design; namely, monadic parser combinators, as it is a simple and elegant technique that is useful beyond the use case of parsing raw text. At Hasura, we have recently used this technique to rewrite the code that generates a GraphQL schema and checks an incoming GraphQL query against it.

To develop an intuition for this technique, in the course of this article, we will first reimplement a simplified version of Parsec, and then use it to write a parser for JSON values^[2]. Our goal here won’t be to develop a fully-fledged library, quite the opposite: we will implement the strict minimum that we need, to focus on the core ideas.

Choosing a type representation for parsers

From a high level perspective, a parser can be thought of as a translation function: it takes as input some loosely-structured data (most of the time: text), and attempts to translate it into structured data, following the rules of a formal grammar. A compiler will transform a series of characters into an abstract syntax tree, a JSON parser will transform a series of characters into an equivalent Haskell representation of the JSON value. There are several algorithmic approaches to this problem; parser combinators are an example of recursive descent: we parse each term of our grammar by recursively calling the parsers for each sub-term.

What that means is that, by “parser”, we both mean the overall high level transformation and each of its individual steps, as each parser is expressed recursively as a combination of other parsers.

A minimal viable implementation

Let’s build up an understanding of the type of a parser step by step. First, we have established that a parser is a function, from some given input to whatever it is that we’re trying to parse. Throughout this project, we will be using String as input, for simplicity^[3]. We could therefore decide to use the following to represent our parsers:

type Parser a = String -> a

But this type isn’t enough for our use case. First and foremost: a parser can fail, if the input doesn’t match what our grammar dictates. We will need to define an appropriate type to represent errors, and a parser needs to include this possibility at the type level.

type Parser a = String -> Either ParseError a

Furthermore, parsing is a sequential operation: to parse a JSON object, one must first parse an opening curly bracket, then parse entries, then parse a closing curly bracket; recursively, to parse an entry, one must first parse a key, then a colon, then a value. Our parser type will be used to represent each of those steps, and we don’t expect each step to fully consume the input string: each parser must therefore also return something that indicates where we are in the input stream, how much of it was consumed, or what’s left to process.

Since we’re using String as our input stream, it is enough for each parser to simply return what’s left of the input string^[4]. The Parser type that we will use is thus:

newtype Parser a = Parser {
  runParser :: String -> (String, Either ParseError a)
}

Writing elementary parsers

As is common in Haskell, we start by covering the “axiomatic” base cases, and we then generalize / extrapolate from there. Given that our stream is a String, and our individual tokens are just the characters of said string, our two base cases simply correspond to the two constructors of a list: there are tokens left in the input stream, or we’ve reached the end. We call those two functions any and eof, respectively:

any :: Parser Char
any = Parser $ \input -> case input of
  -- some input left: we unpack and return the first character
  (x:xs) -> (xs, Right x)
  -- no input left: the parser fails
  []     -> ("", Left $ ParseError
    "any character"        -- expected
    "the end of the input" -- encountered
  )

eof :: Parser ()
eof = Parser $ \input -> case input of
  -- no input left: the parser succeeds
  []    -> ("", Right ())
  -- leftover data: the parser fails
  (c:_) -> (input, Left $ ParseError
    "the end of the input" -- expected
    [c]                    -- encountered
  )

Those are the only two basic parsers that we need! They cover the two base cases: we have a character to parse, or none. Everything else can be expressed in terms of those two, and by using combinators.

Sequencing parsers

As mentioned earlier, parsing is sequential. We will often need to express something along the lines of: I want to apply some parser A, then some parser B, and use their results. Let’s look at how we would implement a parser for a JSON object entry, for instance: we need to parse a json string, then a colon, then a json value (we will assume those individual parsers already exist). The resulting code is… suboptimal: repetitive, error-prone, and hard to read:

jsonEntry :: Parser (String, JValue)
jsonEntry = Parser $ \input ->
  -- parse a json string
  case runParser jsonString input of
    (input2, Left err)  -> (input2, Left err)
    (input2, Right key) ->
      -- on success: parse a single colon
      case runParser (char ':') input2 of
        (input3, Left err) -> (input3, Left err)
        (input3, Right _)  ->
          -- on success: parse a json value
          case runParser jsonValue input3 of
            (input4, Left err)    -> (input4, Left err)
            (input4, Right value) ->
              -- on success: return the result
              (input4, Right (key, value))

This is made quite cumbersome by both the fact that each step might fail, and by the fact that each step returns the input for the next one. Furthermore, it forces us to know about the internal structure of a parser to know how to chain them. To make things easier, we can extract that pattern into a standalone function:

andThen :: Parser a -> (a -> Parser b) -> Parser b
parserA `andThen` f = Parser $ \input ->
  case runParser parserA input of
    (restOfInput, Right a) -> runParser (f a) restOfInput
    (restOfInput, Left  e) -> (restOfInput, Left e)

This function allows us to rewrite our JSON entry parser in a way that no longer requires explicitly introspecting each parser along the way.

jsonEntry :: Parser (String, JValue)
jsonEntry =
  -- parse a json string
  jsonString `andThen` \key ->
    -- parse a single colon
    char ‘:’ `andThen` \_ ->
      -- parse a JSON value
      jsonValue `andThen` \value ->
        -- create a constant parser that consumes no input
        constParser (key, value)

By the power of monads!

To some readers, this andThen pattern will feel familiar; that’s because andThen is exactly the same as Monad’s bind operator:

(>>=) :: Parser a -> (a -> Parser b) -> Parser b

By making our Parser type an instance of Monad, we can leverage the power of all the functions that come with it and the much more comfortable syntax of do notation for our parsers^[5]. At last, our jsonEntry function can be rewritten in a concise and straightforward manner:

jsonEntry :: Parser (String, JValue)
jsonEntry = do
  key   <- jsonString
  _     <- char ‘:’
  value <- jsonValue
  return (key, value)

As a sidenote, it is worth mentioning that while do notation makes the sequencing explicit, we will also make use of operators throughout the code (this wouldn’t be “true Haskell” otherwise :P). Most notably, the following Functor and Applicative operators:

(*>) :: Parser a -> Parser b -> Parser b
(<*) :: Parser a -> Parser b -> Parser a
(<$) :: a -> Parser b -> Parser a

-- ignore leading spaces
spaces *> value

-- ignore trailing spaces
value <* spaces

-- substitute a value
True <$ string “true”

Recap

We have so far defined our two elementary parsers, and seen how chaining is a monadic operation. The stage is set: our basic needs are covered. We can now start expressing more interesting parsers. To start with, let’s introduce satisfy: a small wrapper around any that allows us to check whether the next character in the input meets some arbitrary requirement.

satisfy :: String -> (Char -> Bool) -> Parser Char
satisfy description predicate = try $ do
  c <- any
  if predicate c
    then return c
    else parseError description [c]

This function, while straightforward, uses other functions we haven’t encountered yet, such as try. It’s our first combinator!

Combining parsers

Formally, a combinator is a function that does not rely on anything but its arguments, such as (.) the function composition operator:

(f . g) x = f (g x)

But informally… a combinator is something that combines other things. And it is in that informal sense that we use it here! In our context, parser combinators are functions on parsers: functions that combine and transform parsers into other parsers, to handle such things as backtracking, or repetition.

Choice, errors, and backtracking

We know that parsers can fail: that’s why we have the Either in their type after all. But not all errors are fatal: it is possible to recover from some of them. Imagine, for instance, that you are trying to parse a JSON value: it could be an object, or an array, or a string… To parse such a value, you could try to parse the opening curly braces of an object: if that succeeds, you can continue with parsing an object; but if that fails immediately, you can instead try to parse the opening bracket of an array, and so on. It is with error and backtracking that we can implement our first “advanced” feature: choice.

To distinguish between a real error that happened further down the line or something that simply was the wrong choice, we make a distinction between parsers that fail without consuming any input, that fail immediately, and parsers that fail later on: we assume that if any input was consumed, then we were on the right “branch”. But looking only one character ahead will not always be enough to decide whether a branch is correct, and in that case we need to be able to backtrack should a branch fail: we need a backtracking combinator. This is what try does: it transforms a parser into a backtracking one: one that consumes no input on failure, that restores the state to what it was. It makes sense to use it for satisfy: if the predicate fails, we haven’t encountered a single character we could unpack, and we should leave the input string unmodified.

try :: Parser a -> Parser a
try p = Parser $ \state -> case runParser p state of
  (_newState, Left err) -> (state, Left err)
  success               -> success

Throughout this project, we name our combinators the exact same way Parsec does. The operator that Parsec uses to represent choice is the same as the one defined by the Alternative typeclass: (<|>). Its semantics are simple: if the first parser failed without consuming any input^[6], then try the second one; otherwise, propagate the error:

(<|>) :: Parser a -> Parser a -> Parser a
p1 <|> p2 = Parser $ \s -> case runParser p1 s of
  (s', Left err)
    | s' == s   -> runParser p2 s
    | otherwise -> (s', Left err)
  success -> success

With this operator, we can finally implement a much more convenient combinator: choice. Given a list of parsers, try them all until one succeeds.

choice :: String -> [Parser a] -> Parser a
choice description = foldr (<|>) noMatch
  where noMatch = parseError description "no match"

Repetition

One last group of combinators we will need for our project is repetition combinators. Those do not require any internal knowledge of our Parser type, and are a good example of what kind of high-level abstraction we can now write. As usual, we use the same names as Parsec: many is the equivalent of a regex’s star and matches zero or more occurrences of the given parser, while many1 is the equivalent of plus: it matches one or more occurrences:

many, many1 :: Parser a -> Parser [a]
many  p = many1 p <|> return []
many1 p = do
  first <- p
  rest  <- many p
  return (first:rest)

Thanks to those, we can also implement sepBy and sepBy1, which match repeated occurrences of a given parser with a given separator between them:

sepBy, sepBy1 :: Parser a -> Parser s -> Parser [a]
sepBy  p s = sepBy1 p s <|> return []
sepBy1 p s = do
  first <- p
  rest  <- many (s >> p)
  return (first:rest)

Recap

That’s it: those seven combinators are all we need to implement a JSON parser from scratch; at this point, we already have a good enough minimal reimplementation of a Parsec-like parser combinators library!

Of course, a fully-fledged library would implement much more: more primitives for handling characters, combinators for optional parsers, more specific repetition combinators, support for better error messages… But this is beyond the scope of this exercise.

Parsers in practice: parsing JSON

To put together everything we’ve seen so far, let’s walk through the process of building the grammar of a JSON parser, step by step. We will do it from the bottom up: starting with parsers for individual characters, then moving on to parsers for syntax, then for scalars… until we can finally express a parser for an arbitrary JSON value. The goal here is to showcase how at each step we can leverage the simpler abstractions we have built, in order to compose something more complex. This section is a bit more code-heavy, but it is my hope that, if you’ve followed everything so far, you will find it as pleasant to read as I do!

We will use the following representation for JSON values, which is very close to what Aeson^[7] defines:

data JValue
  = JObject (HashMap String JValue)
  | JArray  [JValue]
  | JString String
  | JNumber Double
  | JBool   Bool
  | JNull

Recognizing characters

Using the functions of Data.Char, let’s start defining our grammar by defining the kind of characters that we want to recognize: this is a straightforward use of our satisfy function:

char c = satisfy [c]     (== c)
space  = satisfy "space" isSpace
digit  = satisfy "digit" isDigit

Language syntax

To go further, we can define convenient syntax functions, using the aforementioned Applicative operators:

string = traverse char
spaces = many space
symbol s = string s <* spaces

between open close value = open *> value <* close
brackets = between (symbol "[") (symbol "]")
braces   = between (symbol "{") (symbol "}")

symbol’s implementation here is based on Parsec’s definition of a lexeme (in their language definition library), which always skips trailing whitespace; every parser can therefore be safely written with the assumption that there is no leading whitespace it needs to take into account. Thanks to this approach, there are very few explicit mentions of whitespace in our JSON grammar.

Scalars

Here is where our implementation will differ from the JSON standard. For the sake of simplicity, our parser for numbers will only match natural numbers:

jsonNumber = read <$> many1 digit

For booleans, we simply match the two possible cases:

jsonBool = choice "JSON boolean"
  [ True  <$ symbol "true"
  , False <$ symbol "false"
  ]

As for strings, we match a sequence of characters between two double quotes, but we have to handle the possibility that some characters might be escaped. We only handle a small subset of escaped characters, implementing the rest of the specification is left as an exercise to the reader. :)

jsonString =
  between (char '"') (char '"') (many jsonChar) <* spaces
  where
    jsonChar = choice "JSON string character"
      [ try $ '\n' <$ string "\\n"
      , try $ '\t' <$ string "\\t"
      , try $ '"'  <$ string "\\\""
      , try $ '\\' <$ string "\\\\"
      , satisfy "not a quote" (/= '"')
      ]

Arrays and objects

JSON values are by definition recursive: arrays and objects contain other JSON values… To continue with our bottom-up approach, we will be assuming the existence of a top-level jsonValue parser, which will be defined last.

A JSON object is a group of individual entries, separated by commas. We first parse them as a list, using our repetition combinator, then transform said association list into a HashMap:

jsonObject = do
  assocList <- braces $ jsonEntry `sepBy` symbol ","
  return $ fromList assocList
  where
    jsonEntry = do
      k <- jsonString
      symbol ":"
      v <- jsonValue
      return (k,v)

Finally, an array is simply a group of values, between brackets, separated by commas:

jsonArray = brackets $ jsonValue `sepBy` symbol ","

Putting it all together

At last, we can express the top-level parser for a JSON value:

jsonValue = choice "a JSON value"
  [ JObject <$> jsonObject
  , JArray  <$> jsonArray
  , JString <$> jsonString
  , JNumber <$> jsonNumber
  , JBool   <$> jsonBool
  , JNull   <$  symbol "null"
  ]

And… that’s it!

Wrapping up

This article is the result of my personal experience trying to understand the inner workings of Parsec, after using it for years. I tried my hand at writing a small JSON parser from scratch, on the side, as I was learning about what makes Parsec tick, to gain a better understanding by putting it in practice. I was astonished, once done, by how small and concise the resulting code was, and by how straightforward the JSON grammar was. It is my sincere hope that this walkthrough will prove useful to you, and will give you an insight into the beauty of parser combinators!

All the code in this article can be found in this GitHub gist.

For a more advanced example of using parser combinators, if you want to read more about how we used this technique to reimplement our GraphQL schema generation, you can have a look at the description of the PR that introduced it.

Hasura is, of course, hiring. If discussions like the above are something you find compelling, do look at our open roles and apply. If you would like to keep an eye on what the team is building, where we are speaking, and an occasional baby animal gif... the Hasura Community Newsletter is a monthly delivery without the traditional marketing spam.

Thank you for reading!

On a personal note: a hill I am ready to die on is that monads are not complicated, but often poorly taught, or taught in a way that is not properly tailored for the audience. I am seriously considering running some kind of “an intuition for monads if you already know at least one other programming language, in less than 15 minutes or your money back” program. ↩︎
For the sake of simplicity, the code in this article will not be fully standard-compliant: we will only handle natural numbers, and a reduced set of escapable characters. ↩︎
Parser combinator libraries do not directly use String as an input; partly because there are better types available to represent text, such as Text, but also because a parser isn’t always the first translation step: often, such as in the case of a compiler, a lexer or tokenizer will have already transformed the input text into a sequence of lexemes or tokens. What matters for a parser is that the input data can be linearly iterated upon; Parsec calls this a Stream: a String is a stream of Char, and the output of a tokenizer would likewise be a stream of tokens. ↩︎
Most libraries choose to pass along a State record that wraps the input streams and carries additional information, such as current position, which allows for (among other benefits) much better error messages. ↩︎
While Parsec and its derivatives all use this monadic approach, it is possible to build parsers that only rely on Applicative. It is a trade-off: monadic parsers are more powerful, as parsing can branch depending on what was parsed before, but applicative parsers being more restricted allow for static introspection of values. ↩︎
Doing string comparison to check whether the parser failed to consume any input is extremely inefficient, and an unfortunate consequence of our simplified design. If instead of simply passing a String along as state we were to use a proper State record, we could implement this in a more performant manner, by comparing a value that represents our position in the input stream for instance. ↩︎
Aeson is the most widely used Haskell library for JSON manipulation. ↩︎