Rust Browser Part 2: Parsers

In the last part I talked about my motivations for building a new web browser / rendering engine in Rust. Today I'll tackle how I parse HTML and CSS.

What is Valid?

Parsing file formats like HTML and CSS are tricky. Well, parsing anything is tricky, but HTML and CSS are especially bad. This is because browser expect pages to render even if the markup is invalid. HTML and CSS are supposed to be very forgiving formats. This makes building a parser hard because you can't just go by the spec. My first decision is how much should I support broken pages? Every bug I willingly parse will make the code more complex and longer.

Since this isn't going to be a professional browser I decided not to support completely broken pages. The HTML and CSS must be well formed. I won't care if the values in the files are valid though. For example, someone could make a page with a 'foo' element and my parser will handle it. In fact, it makes life easier if I don't care about such errors. But the page must be valid. Every element must be closed (except for the few that HTML says should not be closed) and every CSS property rule must be written correctly with braces and semicolons. If a file isn't well formed then my parser will either crash or only parse part of the document. That's okay for my purposes.

Parser Toolkits

There are generally three ways to write a parser. First you can do it entirely by hand, one byte at a type, with lots of nested loops and state tracking. This kind of parser is extremely annoying to write, but if you want to implement the full HTML parsing algorithm it's your only option. However, since I don't care about handling every broken page on the web I don't need to do that. Plus I'm lazy, so let's move onto option two.

The second way is to use a grammar file, usually in BNF format. Then you can use parser generation tools like YACC and LEX which will produce the parser code for you. Every time you want to add a new feature you can update the grammar file then rebuild the parser. This is definitely a popular option and there are tons of tools to do this, however I don't like it for two reasons. First, you have to write in another language syntax than your host language. Second, you have to deal with this big heavyweight library which may have bugs in it. I want as few dependencies as possible, and I want to stay in my host language (Rust) as much as possible. If I want to mix in something that the grammar tools don't support, like regexes or lists of constants (which are common in CSS) then I have to add another phase after the gammar parser stage. These tools are good for small things but I have found they don't scale as your project gets larger.

So that brings me to option three. Parser Expression Grammars, or PEGs. Instead of writing rules in a DSL (domain specific language), you write functions in your host language. You build up the grammar with simple pieces like 'is_alpha_numeric' and 'zero_or_one'. The functions are passed other functions for smaller bits, so eventually you have your full grammar. Doing it this way is sort of like the grammar file way above, but everything is in your host language. It also means you can write lots of tiny unit tests as you go along. It may take longer than the previous option but you end up with really solid code that you understand, and lots of tests to tell you when you've broken something.

A full explanation of how PEGs work is beyond the scope of this blog, but I'll give a quick example here. Let's say you want to parse an integer. An integer is made up of one or more digits, and a digit is a single character between 0 and 9. Let's make a simple parser for that in a fake language.

let digit = one_of("0123456789");
let integer = zero_or_more(digit);

Now we can write a unit test for integers that looks something like this:

assert_eq(parse(integer,"42"),42);
assert_eq(parse(integer,"643"),643);

Now suppose we want to support floating point numbers. That looks like an integer followed by a decimal point, followed by more integer digits. So let's define that.

let float = sequence(integer, ".", integer);
assert_eq(parse(float,"42.42"),42.42);
assert_eq(parse(float,"643.88"),643.88);

The crucial point here is that we build the definition of float by reusing integer, and we can still run our old integer tests as well as the new one. A complete language parser is built up out of these tiny parts.

Using POM

Of course this is a fake language and I didn't write the implementation of one_of, zero_or_more, or sequence. It is possible to write those by hand, but it's also a common task so there are libraries which can handle that for us. Since I'm using Rust for this project I chose a crate called pom, which uses operator overloading to make the rule definition easier and more compact. Let's see what the integer parser would look like using pom.

fn integer_string<'a>() -> Parser<'a, u8, String> {
    let rule = one_of(b"01234567").repeat(0..);
    rule.convert(|s|String::from_utf8(s))
}
fn integer<'a>() -> Parser<'a, u8, i32> {
    let rule = integer_string();
    rule.convert(|s|i32::from_str(&s))
}
#[test]
fn test_integer() {
    assert_eq!(integer().parse(b"42"), Ok(42));
}

Okay, not quite as clean, but still pretty understandable. An integer is one of the set of digits, repeated one or more times, with some calls to convert formats. Now I can build float using the integer_string function:

fn float<'a>() -> Parser<'a, u8, f32> {
    let rule = integer_string() - sym(b'.') + integer_string();
    rule.convert(|(i1,i2)| f32::from_str(&format!("{}.{}",i1,i2)))
}
#[test]
fn test_float() {
    assert_eq!(float().parse(b"42.42"), Ok(42.42));
}

Notice in the code for float that uses the rule integer_string() - sym(b'.') + integer_string(). The + is overloaded to mean 'include' and the - to mean 'exclude'. All three parts are used for the parsing but only the integer_string() calls will be returned from the rule as a tuple. The second line converts the tuple into a string, then into an f32 (Rust's floating point type).

Using POM is more verbose than if I wrote this parser in a dynamic language than Javascript, but it is *solid* typesafe code. And with the unit tests I will immediately know if I broke something as I add new features.

Building the full parser

It took me several days to build the full parsers for CSS and HTML, but I didn't try to do it all at once. Take a look at the code here. It's at least 50% unit tests. Unit tests are *crucial* for building parsers. These things are deep and complex so you need all the mental help you can get. Unit tests give you that mental help.

Once I got my POM parser up and running I had to start thinking about what data structure the CSS will be parsed into. I started with what Matt Brubeck recommended in his Rust browser series six years ago (though he build a traditional loop parser instead of PEGs). The structure of the code mostly mirrors the actual parser, so each rule can produce the appropriate enum or struct, then they get assembled together.

CSS is well defined in the W3C specs, but I didn't implement it all at once. It's just too big. I started with simple rules using a single selector and a list of key/value pair declarations. Here's the definition of a Rule and the parts that make it up.

pub struct Rule {
    pub selectors: Vec<Selector>,
    pub declarations: Vec<Declaration>,
}
#[derive(Debug, PartialEq)]

pub enum Selector {
    Simple(SimpleSelector)
}

#[derive(Debug, PartialEq)]
pub struct SimpleSelector {
    pub tag_name: Option<String>,
    pub id: Option<String>,
    pub class: Vec<String>,
}

#[derive(Debug, PartialEq)]
pub struct Declaration {
    pub(crate) name: String,
    pub(crate) value: Value,
}

#[derive(Debug, PartialEq, Clone)]
pub enum Value {
    Keyword(String),
    Length(f32, Unit),
    ColorValue(Color),
    HexColor(String),
    ArrayValue(Vec<Value>),
    FunCall(FunCallValue),
    StringLiteral(String),
    Number(f32),
}

The only tricky bit is the Value enum. It turns out there are *many* different kinds of values in CSS, so I tried to make an enum that captures all of them, or at least the ones I'm likely to need. I haven't tackled the crazier parts like css-grid templates and media queries. This is enough to get a basic stylesheet working, however.

HTML

For the HTML parser I used the same system as CSS: a POM parser that generates a custom data structure. In this case the structure is similar to the actual DOM web API.

#[derive(Debug, PartialEq)]
pub struct Node {
    pub node_type: NodeType,
    pub children: Vec<Node>,
}

#[derive(Debug, PartialEq)]
pub enum NodeType {
    Text(String),
    Element(ElementData),
}

type AttrMap = HashMap<String, String>;
#[derive(Debug, PartialEq)]
pub struct ElementData {
    pub tag_name: String,
    pub attributes: AttrMap,
}

Some of the HTML rules get more complicated because DOM nodes can be nested. If I tried to create a rule like

element = open_tag + element.repeat(0..) + close

I would get into an infinite loop because element includes itself. To solve this I broke it up into two parts: element and element_child

fn element_child<'a>() -> Parser<'a, u8, Node> {
    meta_tag() | text_content() | selfclosed_element() | standalone_element() | element()
}
fn element<'a>() -> Parser<'a, u8, Node> {
    let p
        = open_element()
        - space()
        + call(element_child).repeat(0..)
        - space()
        + close_element();

    p.map(|(((tag_name, attributes), children), _end_name)|{
        Node {
            children,
            node_type: NodeType::Element(ElementData{
                tag_name,
                attributes,
            })
        }
    })
}

These two rule functions are mutually recursive. The rule element calls element_child and element_child calls element. The reason this still works is two fold. First, notice that each part of element_child is separated by a pipe | symbol. This is overloaded to mean or. Only one of the items will be used, with the first as the highest priority. The element function is called as the last option so it has the lowest priorty, meaning the more specific types of content will be used first. The other thing I do is in the element function. Instead of calling element_child directly I'm using the call() function. This still calls to the element_child but it does some magic underneath to avoid infinite loops.

You can see the full HTML parser here. Again, it is full of unit tests. Every time I made a change to the parsers I run the full suite of unit tests before committing my changes back to the repo.

That's it for today. Next time I'll work on ordering and matching the CSS rules. Have a good weekend.

Talk to me about it on Twitter

Posted March 14th, 2020

Tagged: rust browser

Josh On Design