In October, WikiLeaks released 391,832 reports from the Iraq War, the most comprehensive set of documents about the conflict to date. Each is a report of a specific incident in the Iraq war. At the rate of one document per minute, it would take 272 days non-stop to read every report -- and you still might miss the big picture. This is exactly the sort of problem where visualization can help, by turning patterns in the documents into patterns in a picture.
The Guardian and others had already created visualizations by plotting the incident locations on a map of Iraq, and by graphing monthly casualties. My Associated Press colleague Julian Burgess and I wanted to go a step further, by designing a visualization based on the richest part of the report: the summary text, a human-readable description of what actually happened. But how? We pulled out our data mining textbooks and started experimenting, eventually settling on a technique that extracts the key words from each document. A document's key words appear frequently in that document, but rarely in all others.
This is a picture of the 11,616 SIGACT ("significant action") reports from December 2006. Each report is a dot, labeled by its key words. Reports with similar key words have edges drawn between them. The location of the dot has nothing to do with geography. Instead, we ran an algorithm that pulls dots with edges between them closer together. Then we labeled each cluster by the key words that are common to the reports in that cluster, and colored each report/dot by the "incident type," as entered by military personnel. The result is an abstract map of the bloodiest month of the war.
The central cluster is blue, the color for the "criminal event" type, and the documents within it all include the word "corpse." There are a heartbreaking number of them, because this was the height of the Iraqi civil war. Sub-clusters include various modifiers such as "shot."
Above this, the blue "criminal event" cluster merges into the green "enemy action" reports. At the interface we have "civ, killed, shot," which are apparently reports of civilians wounded in battle. Enemy actions also have their own clusters labeled with "mortar," "female," "officer," and "injured." We haven't looked into the "female"/"enemy action" cluster yet; perhaps there is a previously untold story there.
There is a red cluster off to the side. Red signifies that the military coded these reports as "explosive hazard," and the documents here all include the words "tanker truck." Sure enough, there are contemporaneous press reports of tankers being used as explosive weapons, and this cluster shows that there were at least several dozen such incidents throughout Iraq in Dec 2006 — though it doesn't immediately distinguish between explosions and attempted or threatened explosions.
There's another cluster of blue criminal action reports, labeled "blindfolded, feet, hands." Bound feet and hands were common in sectarian violence at the time, and some reports include the word "torture." There's a nearby cluster of abductions.
It goes on. December 2006 was a disturbing and complicated time in Iraq, and the visualization has patterns at all scales, especially if you look at the hi-res image and read the tiny single-report labels. There are some dark green "friendly action" reports labeled "convoy," and other "friendly actions" which mention the troublesome town of Haditha (near bottom left). And there is the oil connection, a group of reports which include the word "pipeline."
There is a lot to learn from this image, and much more we could do. To begin with, we'd like to try coloring each dot according to the number of casualties, data which is already in the SIGACTs. We know that over 4,000 U.S. forces and 100,000 civilians died in Iraq, but what were the circumstances of their deaths? Perhaps we can start to answer that question. We also want to find a way to animate this diagram through time, so we can see how the war changed as it progressed. And of course, we want to get these visualizations into the hands of reporters, U.S. forces, and others who were actually there, so they can help us understand what all of this means.
For more about this visualization, including the technical details, see this article. We see so much potential that The Associated Press, in collaboration with visualization experts and other news organizations, is developing an open-source system for visual exploration of large document sets of many different types. Whether they're leaked, obtained under freedom of information laws, or released as part of government transparency initiatives, large document sets have become a significant part of journalism.
Jonathan Stray is the Interactive Technology Editor of the Associated Press
More data
Data journalism and data visualisations from the Guardian
Can you do something with this data?
Flickr Please post your visualisations and mash-ups on our Flickr group or mail us at data@guardian.co.uk
World government data
• Search the world's government datasets
• More environment data
• Get the A-Z of data
• More at the Datastore directory
• Follow us on Twitter
Comments in chronological order (Total 2 comments)
16 December 2010 11:28AM
Great graphic display Jonathan. Two questions.
1) What is happening in the top left of the graphic? It looks a little ordered around one point.
2) In assessing the frequency of, for example, a word such as dangerous what measures were taken to check whether it was preceded by the word 'not'?
16 December 2010 8:34PM
So which technique was it? The above description would fit most text mining.
The AP open-source visualisation system is very good news.