Sacred Language Cows Part 2: we can rebuild it. We have the technology.

I recently wrote an amusing rant on programming languages called "An open letter to language designers: Kill your sacred cows." It was, um, not well received. If you read some of the comments on Reddit and Hacker News you see that most people think I'm an idiot, a novice, or know nothing about programming languages. *sigh*. Maybe I am.

Most people got hung up on my first point: don't store source code as plain text on disk. I thought it was innovative but relatively uncontroversial. Surely, one day we won't store source as ASCII text files, so let's start thinking about it now. I mean, surely the Starship Enterprise's OS wasn't written in plain text files with VIM. Surely! I guess I underestimated the Internet.

Perhaps Not

So why did I not spark any real discussion? First I must blame myself for writing in the form of a rant. I thought the obnoxious insults would be amusing jokes, very tongue in cheek. Alas, tone is very hard to convey on the Internet through prose. That is completely my fault. I think I made an even bigger error, though. I talked about some interesting ideas (or rather, what *not* to do) but I didn't give any solutions or go in-depth with my justifications. That is why I have written the essay you are reading now: to explore just the first point and see if it makes sense.

Engineering is about tradeoffs. Cheap, fast, and correct: pick any two. So it is with source code stored as something other than ASCII text. Any modification will require changes in the surrounding ecosystem of tools. Version control. Continuous build systems. Diff tools. IDEs. The question is whether the benefits outweigh the negatives plus the cost of change. In this essay I hope to prove to you that the benefits could outweigh the costs; or at least close enough that it is worth exploring.

What is a codebase?

A codebase is fundamentally a large graph structure. If you look at the typical Java code base you have a bunch of classes with varying visibility organized into packages. The classes contain fields and methods which are composed of statements and expressions. A big tree. Since classes can call each other in varying ways, not to mention other libraries on the class path, we get a potentially circular structure. A codebase also has non-code resources like images, as well as some semi-code structures like XML config files (I'll leave out the build infrastructure for now). So we get a big complex graph that grows bigger and more complex over time.

A good IDE will create this codebase graph in memory. This graph is what gives you things like cheap refactoring and code completion. But when you quit your IDE what does it do? It serializes the entire graph as code split across multiple files and directories. When you restart the IDE it must regenerate this entire graph in memory. Though it wastes some CPU, this system would be fine if everything from the in-memory graph was preserved and restored perfectly without any data loss. Sadly it does not. The source code can't preserve developer settings. This is stored elsewhere. It can't store a history of what the developer actually did at what time. This is stored elsewhere or not at all. I think there is fundamentally something wrong with the fact that we are converting our complex graph into a lossy data store.

Source Code Is Lossy?

Yes, I think source code is lossy. A source file must meet two opposing demands. It must be human readable and writable with as little annoyance as possible. It must also be parseable into a graph structure (the Abstract Syntax Tree or AST) by the compiler without any ambiguity. These twin demands are sometimes in conflict, and when they are the compiler usually wins. Let me give you a simple example. Nested block comments.

In your typical C derived language like Java you can have block comments delimited by /* and */. You can also use single line comments that begin with // and finish at the end of the line. This basic system is a few decades old and has the unfortunate side effect that block comments cannot be nested. The following code is probably clear to a human but will choke the compiler.

some live code /*  first commented out code /* second commented out code */  more first commented out code */ more live code

I know what this means but it is impossible for the compiler to know. Of course, the compiler could start counting open and closing delimiters to figure it out, but then it might choke on a */ stored inside a string. Or the human could nest the single line comments with the help of an IDE. All of these are solvable problems with a more advanced compiler or improved syntax, but they further complicate the language definition and impose more mental cost on the human doing the coding.

The Alternative

Now consider an IDE and compiler that work directly on the graph. I select some code in my editor and click the 'comment' button. It marks that section of the graph as being in a comment. Now I select a larger chunk of code, containing the first chunk, and mark it as a comment. Internally the IDE has no problem with this. It's just another few layers in the tree graph. It can let me move in and out of the code chunks with no ambiguity because it is operating directly on the graph. If the IDE can store this graph to disk as a graph then the compiler can also do its thing with no ambiguity. If we first serialize to human readable source code then parts of the graph are lost and ambiguity results.

This sounds like a trivial example, and quite frankly it is. That's the point. This is a simple case that is hard for the current system but trivial for a graph based system. How much more work must the compiler do to handle harder problems? Here's a few examples of things that become trivial with a graph system.

  • A version control system that loaded up a graph instead of text files could calculate perfect diffs. It would actually know that you renamed class A to class B because the IDE recorded that history. Instead, today it must reverse engineer this information by noticing one file is gone and another one appeared. If you renamed several classes between commits then the diff tool could get confused. With a graph the diffs would be perfect.
  • With perfect diffs the version control system could visualize them in a way that is far more useful. Rather than a bunch of text diffs it could say: "class A was renamed to B. Class D was renamed to G. Class E was deleted. Methods X, Y, and Z were moved up to superclass R. The contents of methods Q and R were modified. Click to see details."
  • Renaming and refactoring is trivial. We already know this because the IDE does it using an internal graph. What if other tools could do that too? You could write a script that would perform complex modifications to entire code bases with ease. (James Gosling once experimented with such as system called JackPot).
  • As with comments, multi-line strings become trivial. Just paste it in. The IDE will properly insert it into the graph. The language definition doesn't have to care and the compiler will work perfectly.
  • Variable interpolation in strings is a function of the IDE rather than the language. The IDE can store "for ${foo} justice" as ("for " + foo + " justice") in the internal graph. This is passed to the compiler so it doesn't have to worry about escaping rules. Any sort of text escaping becomes trivial to deal with.
  • Since the parsing step is gone we no longer need to worry about delimiters like {} and (). We can still use them in the IDE to make the code more comprehensible to humans, but some programmers might prefer indentation to delimiters. The IDE will take care of it, showing them if you wish or hiding them. It makes no difference to the underlying graph or the compiler. Here's a great example of an IDE built by Kirill Osenkov with very creative ways to visualize code blocks.
  • New Things

    Once we have fully committed to storing the graph directly rather than the lossy plain text phase, other things become possible. We can store many forms of metadata directly in the graph rather than splitting it out into separate classes or storing it outside the code repo. Now the IDE can focus on making life better for the developer. Here are a few things I'd like to see.

    • A better interface for creating regexes. Define your regex in a special editor embedded into the main code editor. This editor not only provides docs and syntax highlighting, it can store example data along with the regex. Are you parsing credit card numbers? Type in 20 example strings showing the many ways it could go wrong. The editor shows you how they pass or fail. It also stores it with the source to be executed as unit tests by the continuous build system.
    • Inline docs would be editable with a proper WYSIWYG interface. No more having to remember the particular markup syntax used by this language. (Though you could still switch to raw markdown, html, etc. if you chose). Images in docs would become more common as well. You could also select a snippet of live code and mark it as being an example for the docs above. Then your example code will never be out of date.
    • Inline unit tests. Why must our unit tests be in a separate hierarchy run with separate commands. If it is good to keep docs near the code then it seems logical to keep the tests with the code as well. The IDE could give me a list of tests relevant to the current block of code I'm looking at. I can see the history of these test and if they currently fail. I can create a new test for the current code without creating new classes and files. The easier we make it to create tests the more we will be created.
    • Edit and store non-code resources inline. Icons, vector artwork, animation interpolators, hex values for colors. This would all be stored as part of the source and edited inline using the appropriate interface. Colors can be edited with a color picker instead of remembering hex codes. Animation can be edited with a bezier curve instead of guessing at values. The icon would actually be shown visually instead of just a file path. All of these things can be done today, but they become easier and more standardized if we work directly on the graph instead of plain text. Here is an editor called Field that shows some of the possibilities.
    • Once we get away from the file being the atomic unit of code, then we can start to arrange our code visually in other ways; ways that might be more useful to the human rather than the compiler. Chris Granger just posted a video for a cool new IDE concept called the Light Table.


    I hope I have shown you that there is a wide world of new coding ideas being explored, most of them not related to the language definition but rather to how we human programmers interact with the code. Certainly there are costs associated with such a change, but I think the costs could be dealt with if the benefits are worth it, and I clearly believe they are. We can update version control systems to work with a graph instead of text diffs. We can define a standard for storing the graph on disk so that any IDE can work with it. We can create libraries for manipulating the graph, enabling others to easily create new tools. Those are all solvable problems if we decide we want to.

    Ultimately my disappointment with new programming languages, and the impetus for my original essay, is that we live in the twenty first century but we still code like it's the 1970s. I want to see our field move forward by leaps and bounds, not incremental improvements to C, Smalltalk, and Lisp. The challenges our industry faces will be easier to deal with if we aren't wedded to our current notions of how coding is done. 100 core parallelization. Code scaling from non-smartphones to global computing networks. 3D interfaces built from scene graphs and real time imagery. Ubiquitous computing and latency driven design. We have lot coming, and coming soon, but our language researchers are acting like compilers and languages are a solved problem with only a few tweaks needed. And if we, the real world programmers, are afraid to try anything too radical then perhaps the researchers are right.

    But I don't think so. The challenges coming this century are huge. We should invent new tools to solve them.

    Talk to me about it on Twitter

    Posted April 14th, 2012

    Tagged: rant