Wikileaks Iraq: how to visualise the text

Jonathan Stray shows how AP developed a revolutionary approach to text as data
• Download a high-res version here
• More of our Wikileaks data journalism

Datablog SIGACTS2x460_1 — Wikileaks Iraq: visualising the full text. Click image for full graphic

View larger picture — Wikileaks Iraq: visualising the full text. Click image for full graphic

In October, WikiLeaks released 391,832 reports from the Iraq War, the most comprehensive set of documents about the conflict to date. Each is a report of a specific incident in the Iraq war. At the rate of one document per minute, it would take 272 days non-stop to read every report -- and you still might miss the big picture. This is exactly the sort of problem where visualization can help, by turning patterns in the documents into patterns in a picture.

The Guardian and others had already created visualizations by plotting the incident locations on a map of Iraq, and by graphing monthly casualties. My Associated Press colleague Julian Burgess and I wanted to go a step further, by designing a visualization based on the richest part of the report: the summary text, a human-readable description of what actually happened. But how? We pulled out our data mining textbooks and started experimenting, eventually settling on a technique that extracts the key words from each document. A document's key words appear frequently in that document, but rarely in all others.

Criminal activities where someone died broken down

This is a picture of the 11,616 SIGACT ("significant action") reports from December 2006. Each report is a dot, labeled by its key words. Reports with similar key words have edges drawn between them. The location of the dot has nothing to do with geography. Instead, we ran an algorithm that pulls dots with edges between them closer together. Then we labeled each cluster by the key words that are common to the reports in that cluster, and colored each report/dot by the "incident type," as entered by military personnel. The result is an abstract map of the bloodiest month of the war.

The central cluster is blue, the color for the "criminal event" type, and the documents within it all include the word "corpse." There are a heartbreaking number of them, because this was the height of the Iraqi civil war. Sub-clusters include various modifiers such as "shot."

Enemy action reports

Above this, the blue "criminal event" cluster merges into the green "enemy action" reports. At the interface we have "civ, killed, shot," which are apparently reports of civilians wounded in battle. Enemy actions also have their own clusters labeled with "mortar," "female," "officer," and "injured." We haven't looked into the "female"/"enemy action" cluster yet; perhaps there is a previously untold story there.

Explosive hazards

There is a red cluster off to the side. Red signifies that the military coded these reports as "explosive hazard," and the documents here all include the words "tanker truck." Sure enough, there are contemporaneous press reports of tankers being used as explosive weapons, and this cluster shows that there were at least several dozen such incidents throughout Iraq in Dec 2006 — though it doesn't immediately distinguish between explosions and attempted or threatened explosions.

Criminal actions detailed

There's another cluster of blue criminal action reports, labeled "blindfolded, feet, hands." Bound feet and hands were common in sectarian violence at the time, and some reports include the word "torture." There's a nearby cluster of abductions.

Haditha

It goes on. December 2006 was a disturbing and complicated time in Iraq, and the visualization has patterns at all scales, especially if you look at the hi-res image and read the tiny single-report labels. There are some dark green "friendly action" reports labeled "convoy," and other "friendly actions" which mention the troublesome town of Haditha (near bottom left). And there is the oil connection, a group of reports which include the word "pipeline."

There is a lot to learn from this image, and much more we could do. To begin with, we'd like to try coloring each dot according to the number of casualties, data which is already in the SIGACTs. We know that over 4,000 U.S. forces and 100,000 civilians died in Iraq, but what were the circumstances of their deaths? Perhaps we can start to answer that question. We also want to find a way to animate this diagram through time, so we can see how the war changed as it progressed. And of course, we want to get these visualizations into the hands of reporters, U.S. forces, and others who were actually there, so they can help us understand what all of this means.

For more about this visualization, including the technical details, see this article. We see so much potential that The Associated Press, in collaboration with visualization experts and other news organizations, is developing an open-source system for visual exploration of large document sets of many different types. Whether they're leaked, obtained under freedom of information laws, or released as part of government transparency initiatives, large document sets have become a significant part of journalism.

Jonathan Stray is the Interactive Technology Editor of the Associated Press

More data

Data journalism and data visualisations from the Guardian

Can you do something with this data?

Flickr Please post your visualisations and mash-ups on our Flickr group or mail us at data@guardian.co.uk

World government data

• Search the world's government datasets

• More environment data
• Get the A-Z of data
• More at the Datastore directory
• Follow us on Twitter

Comments in chronological order (Total 2 comments)

Staff
Contributor

Cellarman
16 December 2010 11:28AM

Great graphic display Jonathan. Two questions.
1) What is happening in the top left of the graphic? It looks a little ordered around one point.
2) In assessing the frequency of, for example, a word such as dangerous what measures were taken to check whether it was preceded by the word 'not'?
- Recommend (0)
- Report abuse
- | Link

crazyjane
16 December 2010 8:34PM

We pulled out our data mining textbooks and started experimenting, eventually settling on a technique that extracts the key words from each document. A document's key words appear frequently in that document, but rarely in all others.
So which technique was it? The above description would fit most text mining.
The AP open-source visualisation system is very good news.
- Recommend (0)
- Report abuse
- | Link

Comments on this page are now closed.

Comments

Sorry, commenting is not available at this time. Please try again later.

Datablog weekly archives

Dec 2010
29	30	1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31	1	2

Wikileaks Iraq: visualising the full text. Click here for the high-res version of the graphic above

About us

Today's paper

Zeitgeist

Wikileaks Iraq: how to visualise the text

More data

Can you do something with this data?

World government data

's comment

Comments in chronological order (Total 2 comments)

Comments

Abuse report

Dec 2010
M	T	W	T	F	S	S
29	30	1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31	1	2

Dec 2010
M	T	W	T	F	S	S
29	30	1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31	1	2

About us

Today's paper

Zeitgeist

Wikileaks Iraq: how to visualise the text

More data

Can you do something with this data?

World government data

Email

Share

Contact us

World news

Media

Technology

UK news

World news

Media

UK news

Email

Share

Contact us

About this article

Wikileaks Iraq: how to visualise the text

's comment

Comments in chronological order (Total 2 comments)

Comments

Abuse report

On News

Top topics on this site

Sites we like

Datablog weekly archives

Bestsellers from the Guardian shop

Latest news on guardian.co.uk

News

This week's bestsellers

Sponsored features

Dec 2010
M	T	W	T	F	S	S
29	30	1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31	1	2