This class will cover, in great detail, some of the most advanced techniques used by journalists to understand digital information, and communicate it to users. We will focus on unstructured text information in large quantities, and also cover related topics such as how to draw conclusions from data without fooling yourself, social network analysis, and online security for journalists. These are the algorithms used by search engines and intelligence agencies and everyone in between.
Due to our short schedule — eight classes over three weeks — this will be an intense course. You will be given a homework assignment every class, which should take you 3-6 hours to complete. About half of the assignments will involve some programming in Python. This course will be quite technical — it is, after all, a course about applying computer science to journalism. Aside from being able to program, I assume you know basic computer science theory, and mathematics up to linear algebra. However, the assignments will also require you to explain, in plain English, what the algorithmic result means in journalism terms. The code will not be enough.
Please note that the JMSC is also offering a more accessible data journalism course in May, taught by Irene Jay Liu. You may find that course a better fit if you do not have programming experience. If you are not taking this course for credit you are welcome to sit in on the lectures, but I will not mark your assignments.
You will be assigned readings to study before each lecture. These will typically be research papers. There are also recommended readings that will tell you much more about the topics we cover, and examples of stories that use these techniques.
The course will be graded as follows:
- Assignments: 60%, weighted equally
- Class participation: 10%
- Final project: 30%
Lecture 1. – Basics
We’ll try to define computational journalism, as the application of computer science to four different areas: data-driven reporting, story presentation, information filtering, and effect tracking. But first we have to figure out how to represent the outside world as data. We do this using the feature vector representation. One of the most useful things we can do with such vectors is compute the distances between two of them. We can also visualize the entire vector space, but to do this we have to project the high-dimensional space down to the two dimensions of the screen.
- Computational Journalism, Cohen, Turner, Hamilton
- sections 1 and 2 of The Challenges of Clustering High Dimensional Data, Steinbach, Ertöz, Kumar
- What should the digital public sphere do?, Jonathan Stray
- Precision Journalism, Ch.1, Journalism and the Scientific Tradition, Philip Meyer
- Using clustering to analyze the voting blocs in the UK House of Lords, Jonathan Stray
- The Jobless rate for People Like You, New York Times
- Dollars for Docs, ProPublica
- What did private security contractors do in Iraq and document mining methodology, Jonathan Stray
- The network of global corporate control, Vitali et. al.
- ‘GOP 5′ make strange bedfellows in budget fight, Chase Davis, California Watch
Lecture 2: Text Analysis
Can we use machines to help us understand text? In this class we will cover basic text analysis techniques, from word counting to topic modeling. The algorithms we will discuss this week are used in just about everything: search engines, document set visualization, figuring out when two different articles are about the same story, finding trending topics. The vector space document model is fundamental to algorithmic handling of news content, and we will need it to understand how just about every filtering and personalization system works.
- Online Natural Language Processing Course, Stanford University
- Week 7: Information Retrieval, Term-Document Incidence Matrix
- Week 7: Ranked Information Retrieval, Introducing Ranked Retrieval
- Week 7: Ranked Information Retrieval, Term Frequency Weighting
- Week 7: Ranked Information Retrieval, Inverse Document Frequency Weighting
- Week 7: Ranked Information Retrieval, TF-IDF weighting
- Probabilistic Topic Models, David M. Blei
- General purpose computer-assisted clustering and conceptualization, Justin Grimmer, Gary King
- A full-text visualization of the Iraq war logs, Jonathan Stray
- Introduction to Information Retrieval Chapter 6, Scoring, Term Weighting, and The Vector Space Model, Manning, Raghavan, and Schütze.
Assignment: TF-IDF analysis of State of the Union speeches.
Lecture 3: Algorithmic filtering
This week we begin our study of filtering with some basic ideas about its role in journalism. There’s just way too much information produced every day, more than any one person can read by a factor of millions. We need software to help us deal with this flood. In this lecture, we discuss purely algorithmic approaches to filtering, with a look at how the Newsblaster system works (similar to Google News.)
- Who should see what when? Three design principles for personalized news, Jonathan Stray
- Tracking and summarizing news on a daily basis with Columbia Newsblaster, McKeown et al
- Are we stuck in filter bubbles? Here are five potential paths out, Jonathan Stray
- Guess what? Automated news doesn’t quite work, Gabe Rivera
- The Hermeneutics of Screwing Around, or What You Do With a Million Books, Stephen Ramsay
- Can an algorithm be wrong?, Tarleton Gillespie
Lecture 4: Hybrid filters and recommendation systems
It’s possible to build powerful filtering systems by combining software and people, incorporating both algorithmic content analysis and human actions such as follow, share, and like. We’ll look recommendation systems, the Facebook news feed, and the socially-driven algorithms behind them. We’ll finish by looking at an example of using human preferences to drive machine learning algorithms: Google Web search.
- Finding and Assessing Social Information Sources in the Context of Journalism, Nick Diakopolous et al.
- Item-Based Collaborative Filtering Recommendation Algorithms, Sarwar et. al
- How Reddit Ranking Algorithms Work, Amir Salihefendic
- Google News Personalization: Scalable Online Collaborative Filtering, Das et al
- Slashdot Moderation, Rob Malda
- What is Twitter, a Social Network or a News Media?, Haewoon Kwak, et al,
- The Netflix Prize, Wikipedia
- How does Google use human raters in web search?, Matt Cutts
Assignment: design a filtering algorithm for status updates.
Lecture 5: Network analysis
Network analysis (aka social network analysis, link analysis) is a promising and popular technique for uncovering relationships between diverse individuals and organizations. It is widely used in intelligence and law enforcement, but not so much in journalism. We’ll look at basic techniques and algorithms and try to understand the promise — and the many practical problems.
- Analyzing the Data Behind Skin and Bone, ICIJ
- Identifying the Community Power Structure, an old handbook for community development workers about figuring out who is influential by very manual processes. Centrality and Network Flow, Borgatti
- Visualizing Communities, Jonathan Stray
- The network of global corporate control, Vitali et. al.
- The Dynamics of Protest Recruitment through an Online Network, Sandra González-Bailón, et al.
- Sections I and II of Community Detection in Graphs, Fortunato
- Exploring Enron, Jeffrey Heer
- Galleon’s Web, Wall Street Journal
- Who Runs Hong Kong?, South China Morning Post
Assignment: Compare different centrality metrics in Gephi.
Lecture 6: Structured journalism and knowledge representation
Is journalism in the text/video/audio business, or is it in the knowledge business? This class we’ll look at this question in detail, which gets us deep into the issue of how knowledge is represented in a computer. The traditional relational database model is often inappropriate for journalistic work, so we’re going to concentrate on so-called “linked data” representations. Such representations are widely used and increasingly popular. For example Google recently released the Knowledge Graph. But generating this kind of data from unstructured text is still very tricky, as we’ll see when we look at th Reverb algorithm.
- A fundamental way newspaper websites need to change, Adrian Holovaty
- The next web of open, linked data - Tim Berners-Lee TED talk
- Identifying Relations for Open Information Extraction, Fader, Soderland, and Etzioni (Reverb algorithm)
- Standards-based journalism in a semantic economy, Xark
- What the semantic web can represent - Tim Berners-Lee
- Building Watson: an overview of the DeepQA project
- Can an algorithm write a better story than a reporter? Wired/ 2012.
Assignment: Text enrichment experiments using OpenCalais entity extraction.
Lecture 7: Drawing conclusions from data
You’ve loaded up all the data. You’ve run the algorithms. You’ve completed your analysis. But how do you know that you are right? It’s incredibly easy to fool yourself, but fortunately, there is a long history of fields grappling with the problem of determining truth in the face of uncertainty, from statistics to intelligence analysis.
- Correlation and causation, Business Insider
- The Psychology of Intelligence Analysis, chapters 1,2,3 and 8. Richards J. Heuer
- If correlation doesn’t imply causation, then what does?, Michael Nielsen
- Graphical Inference for Infovis, Hadley Wickham et al.
- Why most published research findings are false, John P. A. Ioannidis
Assignment: analyze gun ownership vs. gun violence data.
Lecture 8: Security, Surveillance, and Censorship
Who is watching our online activities? How do you protect a source in the 21st Century? Who gets to access to all of this mass intelligence, and what does the ability to survey everything all the time mean both practically and ethically for journalism? In this lecture we will talk about who is watching and how, and how to create a security plan using threat modeling.
- Chris Soghoian, Why secrets aren’t safe with journalists, New York times 2011
- Hearst New Media Lecture 2012, Rebecca MacKinnon
- CPJ journalist security guide section 3, Information Security
- Global Internet Filtering Map, Open Net Initiative
- The NSA is building the country’s biggest spy center, James Banford, Wired
Assignment: Use threat modeling to come up with a security plan for a given scenario.