Wikileaks Iraq: how to visualise the text

Jonathan Stray shows how AP developed a revolutionary approach to text as data
Download a high-res version here
More of our Wikileaks data journalism

Datablog SIGACTS2x460_1 View larger picture
Wikileaks Iraq: visualising the full text. Click image for full graphic

In October, WikiLeaks released 391,832 reports from the Iraq War, the most comprehensive set of documents about the conflict to date. Each is a report of a specific incident in the Iraq war. At the rate of one document per minute, it would take 272 days non-stop to read every report -- and you still might miss the big picture. This is exactly the sort of problem where visualization can help, by turning patterns in the documents into patterns in a picture.

The Guardian and others had already created visualizations by plotting the incident locations on a map of Iraq, and by graphing monthly casualties. My Associated Press colleague Julian Burgess and I wanted to go a step further, by designing a visualization based on the richest part of the report: the summary text, a human-readable description of what actually happened. But how? We pulled out our data mining textbooks and started experimenting, eventually settling on a technique that extracts the key words from each document. A document's key words appear frequently in that document, but rarely in all others.

Datablog SIGACTS2x460 Criminal activities where someone died broken down

This is a picture of the 11,616 SIGACT ("significant action") reports from December 2006. Each report is a dot, labeled by its key words. Reports with similar key words have edges drawn between them. The location of the dot has nothing to do with geography. Instead, we ran an algorithm that pulls dots with edges between them closer together. Then we labeled each cluster by the key words that are common to the reports in that cluster, and colored each report/dot by the "incident type," as entered by military personnel. The result is an abstract map of the bloodiest month of the war.

The central cluster is blue, the color for the "criminal event" type, and the documents within it all include the word "corpse." There are a heartbreaking number of them, because this was the height of the Iraqi civil war. Sub-clusters include various modifiers such as "shot."

Datablog: SIGACTS3x460 Enemy action reports

Above this, the blue "criminal event" cluster merges into the green "enemy action" reports. At the interface we have "civ, killed, shot," which are apparently reports of civilians wounded in battle. Enemy actions also have their own clusters labeled with "mortar," "female," "officer," and "injured." We haven't looked into the "female"/"enemy action" cluster yet; perhaps there is a previously untold story there.

Datablog SIGACTS4x460 Explosive hazards

There is a red cluster off to the side. Red signifies that the military coded these reports as "explosive hazard," and the documents here all include the words "tanker truck." Sure enough, there are contemporaneous press reports of tankers being used as explosive weapons, and this cluster shows that there were at least several dozen such incidents throughout Iraq in Dec 2006 — though it doesn't immediately distinguish between explosions and attempted or threatened explosions.

Datablog SIGACTS5x460 Criminal actions detailed

There's another cluster of blue criminal action reports, labeled "blindfolded, feet, hands." Bound feet and hands were common in sectarian violence at the time, and some reports include the word "torture." There's a nearby cluster of abductions.

Datablog SIGACTS6x460 Haditha

It goes on. December 2006 was a disturbing and complicated time in Iraq, and the visualization has patterns at all scales, especially if you look at the hi-res image and read the tiny single-report labels. There are some dark green "friendly action" reports labeled "convoy," and other "friendly actions" which mention the troublesome town of Haditha (near bottom left). And there is the oil connection, a group of reports which include the word "pipeline."

There is a lot to learn from this image, and much more we could do. To begin with, we'd like to try coloring each dot according to the number of casualties, data which is already in the SIGACTs. We know that over 4,000 U.S. forces and 100,000 civilians died in Iraq, but what were the circumstances of their deaths? Perhaps we can start to answer that question. We also want to find a way to animate this diagram through time, so we can see how the war changed as it progressed. And of course, we want to get these visualizations into the hands of reporters, U.S. forces, and others who were actually there, so they can help us understand what all of this means.

For more about this visualization, including the technical details, see this article. We see so much potential that The Associated Press, in collaboration with visualization experts and other news organizations, is developing an open-source system for visual exploration of large document sets of many different types. Whether they're leaked, obtained under freedom of information laws, or released as part of government transparency initiatives, large document sets have become a significant part of journalism.

Jonathan Stray is the Interactive Technology Editor of the Associated Press

More data

Data journalism and data visualisations from the Guardian

Can you do something with this data?

Flickr Please post your visualisations and mash-ups on our Flickr group or mail us at data@guardian.co.uk

World government data

Search the world's government datasets

More environment data
Get the A-Z of data
More at the Datastore directory

Follow us on Twitter


Your IP address will be logged

Comments in chronological order (Total 2 comments)

  • This symbol indicates that that person is The Guardian's staffStaff
  • This symbol indicates that that person is a contributorContributor
  • Cellarman

    16 December 2010 11:28AM

    Great graphic display Jonathan. Two questions.

    1) What is happening in the top left of the graphic? It looks a little ordered around one point.

    2) In assessing the frequency of, for example, a word such as dangerous what measures were taken to check whether it was preceded by the word 'not'?

  • crazyjane

    16 December 2010 8:34PM

    We pulled out our data mining textbooks and started experimenting, eventually settling on a technique that extracts the key words from each document. A document's key words appear frequently in that document, but rarely in all others.

    So which technique was it? The above description would fit most text mining.

    The AP open-source visualisation system is very good news.

Comments on this page are now closed.

Comments

Sorry, commenting is not available at this time. Please try again later.

Datablog weekly archives

Dec 2010
M T W T F S S
20 21 22 23 24 25 26
27 28 29 30 31 1 2

Bestsellers from the Guardian shop

Latest news on guardian.co.uk

Last updated less than one minute ago

Guardian Bookshop

This week's bestsellers

  1. 1.  Parlour Games for Modern Families

    by Myfanwy Jones £7.99

  2. 2.  Bedside Guardian 2010

    by Christopher Elliott £10.00

  3. 3.  Women of the Revolution

    by Kira Cochrane £14.99

  4. 4.  Buy Guardian Style Guide & How To Write

    £20.00

  5. 5.  Ultimate Guide to Mad Men

    by Will Dean £6.99

Datablog: sigacts1000

Wikileaks Iraq: visualising the full text. Click here for the high-res version of the graphic above