Programming Beyond Text: the Parsing Problem

I’ve written many times about how programming is being held back by storing our code as ASCII text. My efforts garnered a dim reception. As strong as the arguments for other storage formats may be, text works extremely well with existing tools. Leaving text behind means leaving an entire ecosystem of practice and tooling, thus we are stuck in a local maxima.

I’ve pondered this problem for two years, attempting to determine the underlying issue with plain text as a storage medium for code. Sure escape characters are annoying and interfile linking is brittle; but those are fixable with tools. What is the root problem with using plain text for code?

I think I’ve finally cracked the nut. We are, quite simply, parsing at the wrong time.

Syntax is for compilers

Fundamentally a codebase is a symbolic graph. We store it as text using a specific arrangement with specific rules. This is the language’s syntax. Some of the syntax exists for human benefit. For example, infix math notation is easier to read than prefix. Syntax highlighting in modern IDEs also falls into this category, ex: make all text strings green so the human eye can scan it better. And though some syntax is for human benefit, much is really there for the compiler, or more specifically the parser.

Consider the semicolon at the end of a line. Even in older languages like C the line delimiter is usually unnecessary for the human programmer. It’s obvious where one statement ends and the next begins (at least in well written code).

The compiler, however, needs this hint to know that the statement has truly ended. In some languages newlines and strict whitespace rules provide this indication. Whatever the mechanism, the compiler needs to build an internal graph structure before it can actually generate machine executable code.

The syntax of the language provides hints to the parser that it may process code correctly before sending it on to the rest of the compiler. This is often refered to as the difference between the front end and the backend.

Fundamentally language syntax provides ways to resolve ambiguity. The compiler is just a computer program. It doesn’t know what the human actually meant. Syntax provides ways for the human to tell the computer what he/she really meant. We could of course design a compiler that would guess what the programmer wanted, but this would result in very unreliable code. Far better to use syntax as a way to resolve ambiguity.

Parse at Edit Time

I have no argument against syntax. It truly is essential. I just think we are doing it at the wrong time. Rather than resolving the ambiguity at compile time, we should be doing it at edit time. If all ambiguity is resolved at edit time, when the human is actually there typing in code, then the compiler would never need syntactic hints. It wouldn’t need a syntax at all, but could instead be fed the underlying graph structure it really wants.

I am essentially arguing that parsing should move to the editor instead of the compiler. The compiler is no place for disambiguation. The compiler might be invoked by an IDE, or a build process, or on a CI server. These are times when the human isn’t present to provide guidance, thus the need for syntax. Move parsing to when the human is present and many problems become simplified. Mandatory whitespace vs flexible indenting ... spaces vs tabs ... bracket alignment. These problems will simply go away, at least as far as the compiler is concerned. We might still want to enforce a style guide so that humans can continue to understand each other’s code, but this is not a part of the language syntax. It’s simply notation to make reading code easier.

Embrace the Future

By moving the parser to the editor we would not only make the compiler’s job easier, but we would also open up new avenues for pushing programming further. Why have only one whitespace rule? Different editors could have different standards. One programmer might choose newlines and another might choose semicolons. It becomes programmer preference rather than something baked into the language and its compiler.

Maybe there’s a better notation for binary objects. Maybe adding units to number formats would make sense. Maybe we should store all image resources in the source code itself. Or maybe not.

The point is we would be free to explore new ideas without being restricted by the entire code parsing and generation backend. We have to escape this local maxima trap of plain text source code.

Talk to me about it on Blue Sky

Posted June 13th, 2016

Tagged: programming rant

Josh On Design