Make a Markdown Parser with OhmJS

In many of my projects I need to parse something, usually a text based language. Sometimes it's inline docs. Sometimes it's a custom DSL for generating code. And sometimes it's markdown files. My goto tool for this type of work is OhmJS, an open source parser language for JavaScript based on PEGs.

OhmJS has a pretty good tutorial and a great online tool for seeing how your parser works, so I won't cover that same territory here. Instead we're going to build a more complex parser as an example: Markdown.

Markdown is a tricky format because it is not well defined. There are many extensions to it and it has some quirky rules. My first few attempts ended in failure. I could parse the first paragraph but syntax inside the text block would interfere with knowing when the block ended and the next one began. On my third attempt I figured out the trick: don't parse it all at once. Instead use two passes. The first pass finds the block boundaries and the second pass dives inside the blocks. With that crucial decision made the rest is easy. Let's take a look.

Parsing Blocks

Here is the complete parser for the first pass

MarkdownOuter {
doc = block+
block = blank | h3 | h2 | h1 | bullet | code | para | endline
h3 = "###" rest
h2 = "##" rest
h1 = "#" rest
para = line+ //paragraph is just multiple consecutive lines
bullet = "* " rest (~"*" ~blank rest)*
code = q rest (~q any)* q //anything between the \`\`\` markers
q = "\`\`\`" // start and end code blocks
nl = "\\n" // new line
sp = " "
blank = sp* nl // blank line has only newline
endline = (~nl any)+ end
line = (~nl any)+ nl // line has at least one letter
rest = (~nl any)* nl // everything to the end of the line
}

The document is defined as a series of blocks. A block can be one of a blank line, the headers, a bullet in a list, a code block, or a generic paragrah. The headers and are paragraph are pretty simple, but the bullet and code need some explanations.

A code block is delimited with triple backquotes and can contain anything, including newlines and other header markers. To handles this I used the negative lookahead operator, ~ , which matches everything but the pattern. This way I can say "get anything except the ending ```, including newlines". To keep it simple I defined the triple quotes as q so that I can reuse it in the code pattern: q (~q any)* q. This works perfectly except for one thing: the opening ``` must be on a line by itself. However, in real world markdown sometimes people put the name of the code snippet's language on that line like this:

``` javascript
console.log("i'm some javascript")
```

To handle that I defined another rule called rest which slurps up anything until the end of the line. If there is any text there it will be passed to my semantics which can handle it properly instead of trying to parse it here. The final rule is the one you see above with rest in it. Bullet list items are handled by a rule structured the same way as code.

Block AST

Once the blocks are parsed you can use semantics to turn them into whatever structure you want. In my case I made a tiny AST with the name of the block and it's contents as a big string.

const H1    = (content) => ({type:'H1', content})
const H2 = (content) => ({type:'H2',content})
const H3 = (content) => ({type:'H3',content})
const P = (content) => ({type:'P',content})
const LI = (content) => ({type:'LI',content})
const code = (language,content) => ({type:'CODE', language, content})

and my semantics look like this:

    parser.semantics.addOperation('blocks',{
_terminal() { return this.sourceString },
h1:(_,b) => H1(b.blocks()),
h2:(_,b) => H2(b.blocks()),
h3:(_,b) => H3(b.blocks()),
code:(_,name,cod,_2) => code(name.blocks(),cod.blocks().join("")),
para: a=> P(a.sourceString),
blank: (a,b) => ({type:'BLANK'}),
bullet: (a,b,c) => LI(b.sourceString + c.sourceString),
rest: (a,_) => a.blocks().join("")
})

What's on the Inside?

Now we need to process the contents of the blocks. I do that with a second grammar like this:

MarkdownInner {
block = para*
para = link | bold | italic | code | plain
plain = ( ~( "*" | "\`" | "[" | "__") any)+
bold = "*" (~"*" any)* "*"
italic = "__" (~"__" any)* "__"
code = "\`" (~"\`" any)* "\`"
link = "!"? "[" (~"]" any)* "]" "(" (~")" any)* ")"
}

This defines a paragraph as a list of spans which can be either links, bold, itlics, inline code, or just plain text. Each sub rule uses a simliar pattern with a negative lookahead. Links use the pattern twice because they have two parts. The operation to process the block contents looks like this.

    parser.semantics.addOperation('content',{
_terminal() { return this.sourceString },
plain(a) {return ['plain',a.content().join("")] },
bold(_1,a,_2) { return ['bold',a.content().join("")] },
italic(_1,a,_2) { return ['italic',a.content().join("")] },
code:(_1,a,_2) => ['code',a.content().join("")],
link:(img,_1,text,_2,_3,url,_4) => ['link',
text.content().join(""),
url.content().join(""),
img.content().join("")]
})

Agan it's using the simple AST for runs of style within the block.

That's pretty much it. The rest of the code attaches the two parsers together with a nice async function.

I've submitted the full code to the OhmJS repo here.

Talk to me about it on Twitter

Posted July 16th, 2021

Tagged: ohm