posted Mon Jun 09 2014
The Web is amazing for answering questions. Suppose you want to answer a question like, "what does the .JPG file extension mean", then the answer is just an internet search away. Millions of answers. However, if you stray from the common path just a tiny bit things get hairy. What if you want to get a list of all file extensions? This is harder to find. Occasionally you might find a PDF listing them, but if you are asking for all file extensions then you probably want to do something with that list. This means you want the list in some computable form. A database or at least a JSON file. Now you are in the world of ‘public’ data. You are in a world of pain.
Searching for “list of file extensions” will take you to the Wikipedia page, which is open but not computable friendly. Every other link you find will be spam. An endless parade of sites which each claim they are the central repository of file extension data. They all have two things in common:
- They are filled with horrible spam like ‘scan your computer to speed it up’ and ‘best stock images on the web’ and ‘get your free credit report now’.
- They let you add new extensions but don’t let you download a complete list of the existing ones.
What I want is basic facts about the world; facts which are generated by the public and really should belong to the public. And I want these facts in a computable form. So far I cannot find such a source for file extensions. These public facts, as they exist on the internet, have morphed into a spam trap: vending tiny bits of knowledge in exchange for eyeball traffic. These sites take a public resource and capture all value from it, providing nothing in return but more virus scanner downloads. That they also provide so little useful information is the reason I have not linked to them (though they are obviously a search away if you care).
The closest I can find to a computable file extension list is the mime type database inside of Linux distros. This brings up a second point. Every operating system, and presumably web browser, needs a list of all file extensions, or at least a reasonable subset. Yet each vendor maintains their own list. Again, these are public facts that should be shared, much as the code which processes them is shared.
File Extensions are not the only public facts which suffer the fate of spam capture. I think this hints at a larger problem. If humanity is to enable global computing, then we need a global knowledge base to work from. A knowledge base that belongs to everyone, not just a few small companies, and especially not the spammers.
Wikipedia and it’s various data offshoots would seem to be the logical source of global computable data, yet the results are dismal. After a decade of asking, Wikipedia’s articles still aren’t computable in any real sense. You can do basic full text search and download articles in (often malformed) wikimarkup. That’s it. Want to get the list of all elements in the periodic table? Not in computable form from Wikipedia. Want to get a list of all mammals? Not from them. Each of these datasets can actually be found on the web, unlike the list of file extensions, but not in a central place and not in the same format. The data offshoots of Wikipedia have even bigger problems, which I will address in a followup blog.
So how do we fix this? Honestly, I don’t know. Many of these datasets do require work to build and maintain and those maintainers need to recoup their costs (though many of them are already paid for with public funds). If this was source code I’d just say it should be a project on GitHub. I think that's what we need.
We need a GitHub for data. A place we can all share and fork common data resources, beholden to no one and computable by everyone.
Building and populating a GitHub for data, at least for these smaller and well defined data sets, doesn't seem like a huge technical problem. Why doesn’t it exist yet? What am I missing?
posted Tue Apr 29 2014
During SXSW this year I had the great fortune to see the keynote given by Stephen Wolfram. If you’ve not heard of him before, he’s the guy who created Mathematica, and more recently Wolfram Alpha, an online cloud brain. He’s an insanely smart guy with the huge ambition to change how we think.
When Stephen started, back in the early 1980s, he was interested in physics but wasn’t very good at integral calculus. Being an awesome nerd he wrote a program to do the integration for him. This eventually became Mathematica. He has felt for decades that with better tools we can think better, and think better thoughts. He didn’t write Mathematica because he loves math. He wrote it to get beyond math. To let the human specify the goals and have the computer figure out how to do it.
After a decade of building and selling Mathematica he spent the next decade doing science again. Among other things this resulted in his massive tome: A New Kind Of Science, and the creation of Wolfram Alpha, a program that systematizes knowledge to let you ask ask questions about anything.
In 1983 he invented/discovered a one dimensional cellular autonomy called Rule 30, (which he still has the code printed on his business cards). Rule 30 creates lots of complexity from a very simple equation. Even if one runs just a tiny program it can end up making interesting complexity from very little. He feels there is no distinction between emergent complexity and brain like intelligence. IE: we don't need a brain like AI, the typical Strong AI claim. Rather, with emergent complexity we can augment human cognition to answer ever more difficult questions.
The end result of all of this is the Wolfram Language, which they are just started to release now in SDK form. By combining this language with the tools in Mathematica and the power of a data collecting cloud; they have created something qualitatively different. Essentially a super-brain in the cloud.
The Wolfram Language is a 'knowledge based language’ as he calls it. Most programming languages stay close to the operation of the machine. Most features are pushed into libraries or other programs. The Wolfram Language takes the opposite approach. It has as much as possible built in; that is the language itself does as much as possible. It automates as much as possible for the programmer.
After explaining the philosophy Stephen did a few demos. He was using the Wolfram tool, which is a desktop app that constantly communicates with the cloud servers. In a few keystrokes he created 60k random numbers, then applied a much of statistical tests like mean, numerical value, and skewness. Essentially Mathematica. Then he drew his live Facebook friend network as a nicely laid out node graph. Next he captured a live camera image from his laptop, partitioned it blocks of size 50, applies some filters, compressed the result to a single final image and tweeted the result. He did all of this through the interactive tool with just a few commands. It really is a union of textual, visual, and the network.
For his next trick, Mr. Wolfram asked the cloud for a time series of air temperatures from Austin for the past year then drew it as a graph. Again, he used only a few commands and all data was pulled from the Wolfram Cloud brain. Next he asked for the countries which border Ukraine, calculated the lengths of the borders, and made a chart. Next he asked the system for a list of all Former Soviet Republics, grabbed the flag image for each, then used a ‘nearest’ function to see which flag is closest to the French flag. This ‘nearest’ function is interesting because it isn’t a single function. Rather the computer will automatically select the best algorithm from an exhaustive collection. It seems almost magical. He did a similar demo using images of hand written numbers and the ‘classify’ function to create a machine learning classifier for new hand drawn numbers.
He’s right. The Wolfram Language really does have everything built in. The cloud has factual data for almost everything. The contents of wikipedia, many other public databases, and Wolfram’s own scientific databases are built in. The natural language parser makes it easier to work with. It knows that NYC probably means New York City, and can ask the human for clarification if needed. His overall goal is maximum automation. You define what you want the language to do and then it’s up the language to figure out how to do it. It’s taken 25 years to make this language possible, and easy to learn and guess. He claims they’ve invented new algorithms that are only possible because of this system.
Since all of the Wolfram Language is backed by the cloud they can do some interesting things. You can write a function and then publish it to their cloud service. The function becomes a JSON or XML web service, instantly live, with a unique URL. All data conversion and hosting is transparently handled for you. All symbolic computation is backed by their cloud. You can also publish a function as a web form. Function parameters become form input elements. As an example he created a simple function which takes the names of two cities and returns a map containing them. Published as a form shows the user two text fields to ask for the city names. Type in two cities and press enter, an image of a map is returned. These aren’t just plain text fields, though. They contain are backed by the full natural language understanding of the cloud. You get auto-completion and validation automatically. And it works perfectly on mobile devices.
Everything I saw was sort of mind blowing if we consider what this system will do after a few more iterations. The challenge, at least in my mind, is how to sell it. It’s tricky to sell a general purpose super-brain. Telling people "It can do anything" doesn't usually drive sales. They seem to be aware of this, however, as they now have a bunch of products specific to different industry verticals like physical sciences and healthcare. They don’t sell the super-brain itself, but specific tools backed by the brain. They also announced an SDK that will let developers write web and mobile apps that use the NLP parser and cloud brain as services. They want it to be as easy to put into an app as a Google Maps. What will developers make with the SDK? They don’t know yet, but it sure will be exciting.
The upshot of all this? The future looks bright. It’s also inspired me to write a new version of my Amino Shell with improved features. Stay tuned.
posted Tue Mar 18 2014
One of the benefits of my job at Nokia is the ability to do indepth research on new technologies. If you follow me on G+ then you know I've been playing with 3D printers for the past few months. As part of my research I prepared a detailed overview of the 3D printing industry that goes into the technologies, the companies involved, and some speculation about what the future holds; as well as a nice glossary of terms. Nokia has kindly let me share my report with the world. Enjoy!
posted Fri Feb 07 2014
I'm working on a few submissions for OSCON, due in two days. I've got lots of ideas, but I don't know which ones to submit. Take a look at these and tweet me with your favorite. If you can't make it to OSCON I'll post the presentations and notes here for all to read.
Augmenting Human Cognition
In a hundred years we will have new bigger problems. We will need new, more productive brains to solve these problems. We need to raise the collective world IQ by at least 40 points. This can only be done by improving the human computer interface, as well as improving physical health and concentration. This session will examine what factors affect the quality and speed of human cognition and productivity; then suggest tools, both historic and futuristic, to improve our brains. The tools are jointly health related: disease, sleep, nutrition, and light; digital: creative tools, AI agents, high speed communication; and also physical augmentation: Google Glass, smart drugs, and additional cybernetic senses.
Awesome 3D Printing Tricks
The dream of 3D printing is for the user to download a design, hit print, and 30 minutes later a model pops out. What’s the fun in that? 3D printers are great because each print can be different. Let’s hack them. This session will show a few ‘unorthodox’ 3D printing techniques including mixing colors, doping with magnets and wires, freakish hybrid designs, and mason jars. Lots of mason jars.
The techniques will be demonstrated using the open source Printrbot Simple, though they are applicable to any filament based printer. No coding skills or previous experience with 3D printers is required, though some familiarity with the topic will help.
Cheap Data Dashboards with Node, Amino and the Raspberry PI
Thanks to the Raspberry Pi and cheap HDMI TV sets, you can build a nice data dashboard for your office or workplace for just a few hundred dollars. Though cheap, the Raspberry PI has a surprisingly powerful GPU. The key is Amino, an open source NodeJS library for hardware accelerated graphics. This session will show you how to build simple graphics with Amino, then build a few realtime data dashboards of Twitter feeds, continuous build servers, and RSS feeds; complete with gratuitous particle effects.
HTML Canvas Deep Dive 2014
Behind plain images, HTML Canvas is the number one technology for building graphics in web content. In previous years we have focused on 3D or games. This year we will tackle a more useful topic: data visualization. Raw data is almost useless. Data only becomes meaningful when visualized in ways that humans can understand. In this three hour workshop we will cover everything needed to draw and animate data in interesting ways. The workshop will be divided into sections cover both the basics and techniques specific to finding, parsing, and visualizing public data sets.
The first half of the workshop will cover the basics of HTML canvas, where it fits in with other graphics technologies, and how to draw basic graphics on screen. The second half will cover how to find, parse, and visualize a variety of public data sets. If time permits we will examine a few open source libraries designed specifically for data visualization. All topics we don’t have time to cover will be available in a free ebook to read.
Bluetooth Low Energy: State of the Union
In 2013 Bluetooth Low Energy, BLE, was difficult to work with. APIs were rare and buggy. Hackable shields were hard to find. Smartphones didn’t support it if they weren’t made by Apple, and even then it was limited. What a difference a year makes. Now in 2014 you can easily add BLE support to any Arduino, Raspberry PI, or other embedded system. Every major smartphone OS supports BLE and the APIs are finally stable. There are even special versions of Arduino built entirely around wiring sensors together with BLE. This session will introduce Bluetooth Low Energy, explain where it fits in the spectrum of wireless technologies, then dive into the many options today’s hackers have to add BLE to their own projects. Finally, we will assemble a simple smart watch on stage with open source components.
posted Tue Jan 28 2014