Typographic Programming Wrapup

I need to move on to other projects so I’m wrapping up the rest of my ideas in this blog. Gotta get it outta my brainz first.

The key concept I’ve explored in this series is that the code you see in an editor need not be identical to what is stored on disk, or the same as what is sent to the compiler. If we relax this constraint then a world of opportunity opens up. We’ve been writing glorified text files for 40 years. We can do better. Let’s explore.

Keywords

Why can’t you name a variable for? Because in many common languages for is a reserved word. You, as the programmer, aren’t allowed to use for because it represent a particular loop construct. The underlying compiler doesn’t actually care of course. It doesn’t care about the name of any of your variables or other words in your code. The compiler just needs them to be unique symbols, some of which are mapped to existing constructs like conditionals and loops.

If the compiler doesn’t care then why can’t we do it? Because the parser (the ‘front end’ of the compiler) does care. The parser needs to unambiguously transform a stream of ASCII text into an abstract syntax tree. It’s the unambiguous part that’s the trouble. The syntax restrictions in most common languages are there to make the parser happy. If the parser was magical and could just "know what we meant" then any syntax could be used. Perhaps even syntax that made more sense to the human rather than the computer.

Fundamentally, this is what typographic programming does. It lets us tell the parser which text is what without using specific syntax rules. Instead we use color or font choices to indicate whether a given chunk of text is a variable or keyword or something else. Of course editing in such a system would be a pain, but we already know how to solve that problem. Graphical word processors are proof that it is possible. Before we get to how we solve it let us consider why. Would such a system have enough benefits to outweigh the cost of building it? What new things could we do?

Nothing’s reserved

If we use typography to indicate syntax, then keywords no longer need to be reserved. Any keyword could be used as a variable and any text string could be used as a keyword. You could swap for with fore or thusly. You could use spaces in keywords as for each of. These aren’t very useful examples, of course, but the compiler could easily handle them.

With the syntactic restrictions lifted we are free to explore new control flow constructs. How about forever to mean an infinite loop and 10 times for standard for fixed length loops? It’s all the same to the compiler but the human reading it would better understand the meaning.

Custom Operators

If nothing is reserved then user defined operators become easy. After all; what is an operator but a function with a single letter name from a restricted character set. In Python 4 + 5 is just sugar for add(4,5).

With no syntax rules anything could be an operator. Operators could have multiple letter names, or other symbols from the full unicode set. The only reason operators are given special treatment to begin with is because they represent functions which are so commonly used (like arithmetic) that we want a shorthand. With free syntax we can create a shorthand for the functions that are useful to the task at hand rather than the abstract general purpose tasks the language inventors imagined.

Let’s look at something more concrete. Using complex numbers and vectors is common in graphics programming, but we have to use clumsy and verbose syntax in most languages. This is sad. Mathematics already has invented compact notation for these concepts but we can’t use them due to ASCII limitations. Without these limitations we could add complex numbers with the plus sign like this:

A + B

instead of

complexAdd(A,B)

To help the programmer remember these are complex numbers they could be rendered in a different color.

There are two ways to multiply vectors: the dot product and the cross product. They have very different meanings. With full unicode we could use the correct symbols like this:

A ⨯ B

A ��� B // cross product

No ambiguity at all. It would be too much to expect a language to support every possible notation. Much better instead to have a language that lets the programmer create their own notation.

Customization in Practice

So how would this work in practice? At some point the code must be transformed into something the compiler understands. Let’s postulate a hypothetical language called X. X has no syntax rules, only the semantic rules of it’s AST. To tell the complier how to convert the code into the AST we must provide our own rules. Something like this.

fun => function
cross => cross
dot => dot
|x| => magnitude(x)

fun intersection(V,R) {
     return V dot R / |V|;
}

We have now defined a mini language in X which still compiles to the same syntactic structure.

Of course typing all of these rules in every file (or compilation unit) would be a huge pain, so we could include them much as we include external libraries.

@include math.rules

fun intersection(V,R) {
     return V dot R / |V|;
}

Most importantly, not only would the compiler understand these rules but so would the editor. The editor can now indicate that V ⋅ R is valid only if they are both vectors. It could enforce the rules from the rule file. Now our code is limited only by the imagination of our rule writers, not the fixed compiler.

In practice, were X to become popular, we would not see everyone making up their own rules. Instead usage would gather around a few popular rulesets much as JavaScript gathered around a few popular libraries like JQuery. We might call each of these rulesets dialects, each a particular flavor derived from the base X language. Custom DSLs would become trivial to implement. It would be common for developers to use one or two "standard" dialects for most of their code but use a special purpose dialect for a particular task.

The important thing here is that the language no longer has a fixed syntax. It can adapt and evolve as needed. All without changing the compiler.

How do you edit?

I hope I’ve convinced you that a flexible syntax delimited by typography is useful. Many common idioms like iteration, accumulation, and data structure traversals could be distilled to concise syntax. And if it has problems then we can tweak it.

There is one big problem though. How would you actually edit code like this?

Fortunately this problem has already been solved by graphical word processors. These tools use color, font, size, weight and spacing to distinguish one element from another. Choosing the mode for a variable is as simple as selecting it with the cursor and using a drop down.

Manually highlighting a entire page of code would quickly grow tedious, of course. For common operations, like declaring a variable, the programmer could type a special symbol like @. This tells the editor that the next letters are a variable name. The programmer ends it with @ or by pressing the spacebar or return key. This @ symbol doesn’t exist in the code. It is simply to indicate to the editor that the programmer wants to be in variable mode. Once the mode is finished the @’s go away and the text is rendered with the variable font. This is no different than using *word* to indicate bold in Markdown text. The stars never appear in the rendered text.

The choice of the @ symbol doesn't matter as long as it's easy with the user's native keyboard. @ is good for US keyboards. French or Russians might use something else.

Resolving Ambiguity

Even using manual markup might become tedious, though. Fortunately the editor can usually figure out the meaning of any given token by using the dialect rules. If the rules indicate that # equals division then the editor can just do the right thing. Using manual highlighting would only be necessary if the _dialect itself_ introduces an ambiguity. (ex: # means division and also the start of a hex value).

What about multiplying vectors? You could type in either of the two proper symbols, but the average keyboard doesn’t support those directly. You’d have to memorize a unicode code point or use a floating dialog. Alternatively, we could use code completion. If you type * then the editor knows this must be either dot or cross product. It provides only those two choices in a drop down, much as we auto-complete method names today.

Using a syntax free language does not fully remove the need to resolve ambiguity, it just moves the resolution process to edit time rather than compile time. This is good. The human is present at edit time and can explain to the computer was is correct. The human is not there at compile time, so any ambiguity must result in an error that the human must come back and fix. Furthermore, resolving the ambiguity need only happen once, when the human types it, not every time the code is compiled. This will further reduce code regressions when other parts of the system change.

Undoubtedly we would discover more edge cases, but these are all solvable. Modern GUI word processors and spreadsheets prove this. A more challenging issue is version control.

Versioning

Code changes over time. It must be versioned. I don’t know why it took 40 years for us to invent distributed version control systems like Git, but at least we have it now. It would be a shame to give that up just as we’ve gotten the world on board. The problem is Git and other VCSs don’t really understand code. They just understand text. There are really only two ways to solve this:

1) modify git, and the other tools around it (diff viewers, github’s website, etc.) to support binary diffs specific to our new system.

2) make the on disk format be pure text.

Clearly option 1 is a non-starter. One day, once language X takes over the world, we could ask the GitHub team to add support for X diffs, but that’s a long ways off. We have to start with option 2.

You might think I’m going back on what I said at the start. After all, I stated we should no longer be writing code as text on disk, but that is exactly what I am suggesting. What I don’t want is to store _the same thing_ that we edit. From the VCS’s point of view the editor and visual representation are irrelevant. The only thing that matters is what is the file on disk. X needs a canonical on serialization format. Regardless of what tool you use to edit X, as long as it saves to the same format we are fine. This is no different than SQL or HTML. Everyone has their favorite tool, but they all write to the same format.

Canonical Serialization Format.

X’s serialization format should obviously be plain text. < 128bit ASCII would be fine, though I think we could handle UTF8 easily. Most modern diff tools can work with UTF8 cleanly, so Japanese comments and math symbols would come through just fine.

The X format should also be unambiguous. Variables are marked up explicitly as variables. Operators as operators. There should be no need for the parser to guess at anything or interpret syntax rules. We could use one of the many existing formats like JSON, XML, or even LaTex. It doesn’t really matter since humans will rarely need to look at them.

But.... since we are defining a new serialization format anyways, there are a few useful things we could add.

Code as Graph

Code is really just a graph. Graphs can be serialized in many ways. Rather than using function names inline they could be represented by identifiers which point to a lookup table. Then, if a function is renamed the code only changes in one place rather than at every point in the code where the function is used. This creates a semantic diff that the diff tool could render as ‘function Y renamed to Z’.

v467 = foo
v468 = bar
v469 = baz

fun v467 () {
   return v468 + v469;
}

Semantic diff-ing could be very powerful. Any refactoring should be reducible to its essential meaning: moved X to a new class or extracted Y from Z. Whitespace changes would be ignored (or never stored in the first place). Commit messages could be context aware: changed X in the unit test for Y and added logging to Z. Our current tools just barely understand when a file has been renamed instead of deleted and a new one added. There’s a lot of room for innovation here.

WrapUp

I hope I’ve convinced you there is value in this approach. Building language X still won’t be easy. To be viable we have to make a compiler, useful dialect definitions, and a visual editor; all at the same time. That’s a lot of work before anyone else can use it. Building on top of existing tools like Eclipse or Atom.io would help, but I know it’s still a big hill to climb. Trust me. The view will be worth it.

Talk to me about it on Twitter

Posted October 6th, 2014

Tagged: fonts programming

Josh On Design