What I Do
Data journalism has sprawled into an all-encompassing term meaning new digital journalism excluding social media/community. In other words, news generated from a newsroom which doesn’t solely consist of text print. However, there is a wide variety of fields interwoven and interrelated into the term data journalism.
So when asked about getting into data journalism, one has to clarify which branch. The first split from the trunk has to be the broader areas of investigative data journalsim and data visualisation. As newsrooms moved from print to digital, few of them had the chance or expertise to experiment and so the whole tree fell under the data journalism canopy. Thus they were made of fullstack developers (most were designers and developers with an editorial leader, such as the NewYork Times team run by Aron Pilhofer).
— Greg Linch (@greglinch) March 3, 2013
What I Need To Do It
There is no clear way to get into data journalism. Firstly you must decide what area of this wide field you are interested in. You must find where your passion and curiosity lies. You can never get to a level of training where you feel you can satisfactorily do your job. That’s the difference between journalism and data journalism. The web evolves in 3 month cycles and so you are always playing catch up, whether that’s retrieving data or making the latest unique interactive.
The most important thing you need is drive. Because it is hard. I have learnt an object oriented programming language primarily for scraping, Python, a database language MySQL (ElasticSearch if the data is very large) and an analysis software language, R. I want to get at a story, I want to find an interesting pattern, a tantalising trail. For that, I need to be able to get, clean, sort and analyse data in order to understand it.
— Greg Linch (@greglinch) March 3, 2013
But it’s not just the the skills that are an important part of the journey. That’s just what you need to get ready. It’s what you learn along the way that is your greatest asset.
How I Do It
I work in a virtual world. Literally. The only software I have installed on my machine are VirtualBox and Vagrant. I create a virtual machine inside my machine. I have blueprints for many virtual machines. Each machine has a different function i.e. a different piece of software installed. So to perform a function such as fetching the data or cleaning it or analysing it, I have a brand new environment which can be recreated on any computer.
I call these environments “Infinite Interns“. In order to help journalists see the possibilities of what I do, I tell then to think about what they could accomplish if they had an infinite amount of interns. Because that’s what code is. Here are a couple of slides about my Infinite Interns system:
To use this system I have another repository called Skel. This is the skeleton layout for all my data driven projects. Every time I start a new project I download Skel. This also has folders for the various components of my investigation, the source data, the transformed data, the analysis, etc. With each intern I bring to life I transform the data and put the results on my own machine. After that I kill it. A benefit of having virtual interns is that no labour laws apply!
All my processes are coded, as are my environments, so I can write a script that will automatically run the investigation from start to finish in one command. In that way it is completely transparent and reproducible.
Why I Do It
It has taken me a long time to develop this process. It’s built on what I’ve come to learn as an OpenNews Fellow at The Guardian. I use code to do journalism, to find and report stories. Thus my process of development is not built around agile or responsiveness or all the other ways developer teams are run. My process is built around transparency and reproducibility. It is meant to stand up in court.
I practice this on every scale. For instance, I have a three step process when I scrape. For a large enough dataset, there will be a search function on a webpage which will pull out sections of the database. I write a scraper to pull out every section, paginate through the entries and store the URLs for each of the individual entries. Then I write another scraper to go to all the URLs and store the HTML for the page. All of it. Only after I have done that do I write a third and final scraper to collect all the necessary details and store it in a structured database.
Having recently taught this process to a group of developers at Al Jazeera English, one of them asked me why I have this second interim step. Surely it wastes time and computing space. You need it just as a journalist needs their notes. When you scrape a website you are fetching information from a site that has fetched it from a server. The entity you are investigating has access to that server and can change that information. It will change on the site and what proof do you have that you did not fabricate the data or retrieve it incorrectly. If you have the HTML you have proof of what the website retrieved at that point in time.
I am a journalist by training and am learning developer skills. But I am creating my own processes based on what I need as a journalist not as a developer. It is an ever evolving process but that insight is the result of two years delving into the world of data journalism. As long as you have an insight of your own, a unique way of getting done what you need to get done, then you are a data journalist.