It’s a beautiful thing when free data meets free analytics

All the free data-analysis tools in the world aren’t too useful if there aren’t also some free datasets available to analyze. That’s why it’s cool to see BigML, the machine learning service I’ve been writing about for the past year, decide to collaborate with open-data provider Quandl. Even if neither service reaches mass market popularity, I like seeing stakeholders from different camps work together to lay the groundwork for a data democracy.

I won’t waste your time recapping BigML — I’ve done it in detail before — but will note that the service does have some new features since the last time I played around with it. Among them is a new sunburst visualization to complement the classic tree one.

However, if you’re new to Quandl (like I am), it’s pretty cool. It’s a free service offering up more than 6 million financial, economic and social datasets that are neatly formatted and ready for consumption. Even better is that most (maybe all) of the datasets are organized by time, and Quandl automatically brings up an embeddable and interactive line chart when you click on the link to open the dataset.

Even better than that is the service’s “Supersets” feature, which lets you add columns from multiple datasets — in one click, mind you — together to form one big dataset comprised of a bunch of disparate variables. Someone interested in analyzing the unemployment rate in Nevada, for example, could create a Superset that compares it against other factors such as currency exchange rates, the U.S. Misery Index (the national unemployment rate plus the inflation rate) and residential energy consumption in California. I did just that, and the result looks like this (absent the energy consumption variable):

superset quandl

The table looks like this:

superset table

These could, of course, be completely unrelated variables — thus making any correlations all but meaningless — but my assumption is that Nevada depends heavily on tourism, so things going on nationally, internationally and in neighboring states could affect it. I might have chosen other variables, but one of the drawbacks of Quandl right now is that even though it has 6 million datasets, they’re not all super useful. Hopefully, that will change over time.

That aside, though, I think it’s pretty easy to see the value of a service like this.

Once you download the dataset as a CSV file and upload it to BigML, you can start making predictions. Here’s how all this data looks as a sunburst:

bigml sunburst

Here’s the prediction interface, which lets you adjust each variable using a slider:

nvur prediction

This would work better with a larger dataset from which to derive the predictions (and probably if I had better data skills), but the bigger picture is this: I was able to do this in about an hour, and after a long day of working and parenting. As more datasets become publicly available, and as consumers begin getting deluged with their own data from activity trackers, health apps, Google, and any number of data-ownership or data-liberation efforts, they’ll probably want some way to start making sense of it all. None of the myriad services available for doing so are perfect, but we’re headed down the right track.

To prove that point, here are the Nevada unemployment and the Misery Index charted against each other using Datahero. It has improved quite a bit since entering public beta in April, including getting the export feature to work. This took about 2 minutes.

DataHero UnemploymentMisery