Syllabus

This class will cover, in great detail, some of the most advanced techniques used by journalists to understand digital information, and communicate it to users. We will focus on unstructured text information in large quantities, and also cover related topics such as how to draw conclusions from data without fooling yourself, social network analysis, and online security for journalists. These are the algorithms used by search engines and intelligence agencies and everyone in between.

Due to our short schedule — eight classes over three weeks — this will be an intense course. You will be given a homework assignment every class, which should take you 3-6 hours to complete. About half of the assignments will involve some programming in Python. This course will be quite technical — it is, after all, a course about applying computer science to journalism. Aside from being able to program, I assume you know basic computer science theory, and mathematics up to linear algebra. However, the assignments will also require you to explain, in plain English, what the algorithmic result means in journalism terms. The code will not be enough.

Please note that the JMSC is also offering a more accessible data journalism course in May, taught by Irene Jay Liu. You may find that course a better fit if you do not have programming experience. If you are not taking this course for credit you are welcome to sit in on the lectures, but I will not mark your assignments.

You will be assigned readings to study before each lecture. These will typically be research papers. There are also recommended readings that will tell you much more about  the topics we cover, and examples of stories that use these techniques.

The course will be graded as follows:

  • Assignments: 60%, weighted equally
  • Class participation: 10%
  • Final project: 30%

Lecture 1. – Basics
We’ll try to define computational journalism, as the application of computer science to  four different areas: data-driven reporting, story presentation, information filtering, and effect tracking. But first we have to figure out how to represent the outside world as data. We do this using the feature vector representation. One of the most useful things we can do with such vectors is compute the distances between two of them. We can also visualize the entire vector space, but to do this we have to project the high-dimensional space down to the two dimensions of the screen.

Required

Recommended

Examples

Lecture 2: Text Analysis
Can we use machines to help us understand text? In this class we will cover basic text analysis techniques, from word counting to topic modeling. The algorithms we will discuss this week are used in just about everything: search engines, document set visualization, figuring out when two different articles are about the same story, finding trending topics. The vector space document model is fundamental to algorithmic handling of news content, and we will need it to understand how just about every filtering and personalization system works.

Required

  • Online Natural Language Processing Course, Stanford University
    • Week 7: Information Retrieval, Term-Document Incidence Matrix
    • Week 7: Ranked Information Retrieval, Introducing Ranked Retrieval
    • Week 7: Ranked Information Retrieval, Term Frequency Weighting
    • Week 7: Ranked Information Retrieval, Inverse Document Frequency Weighting
    • Week 7: Ranked Information Retrieval, TF-IDF weighting

Recommended

Examples

Assignment: TF-IDF analysis of State of the Union speeches.

Lecture 3:  Algorithmic filtering
This week we begin our study of filtering with some basic ideas about its role in journalism. There’s just way too much information produced every day, more than any one person can read by a factor of millions. We need software to help us deal with this flood. In this lecture, we discuss purely algorithmic approaches to filtering, with a look at how the Newsblaster system works (similar to Google News.)

Required

Recommended

Lecture 4: Hybrid filters and recommendation systems
It’s possible to build powerful filtering systems by combining software and people, incorporating both algorithmic content analysis and human actions such as follow, share, and like. We’ll look recommendation systems, the Facebook news feed, and the socially-driven algorithms behind them. We’ll finish by looking at an example of using human preferences to drive machine learning algorithms: Google Web search.

Required

Recommended

Assignment: design a filtering algorithm for status updates.

Lecture 5: Network analysis
Network analysis (aka social network analysis, link analysis) is a promising and popular technique for uncovering relationships between diverse individuals and organizations. It is widely used in intelligence and law enforcement, but not so much in journalism. We’ll look at basic techniques and algorithms and try to understand the promise — and the many practical problems.

Required

Recommended

Examples:

Assignment: Compare different centrality metrics in Gephi.

Lecture 6: Structured journalism and knowledge representation
Is journalism in the text/video/audio business, or is it in the knowledge business? This class we’ll look at this question in detail, which gets us deep into the issue of how knowledge is represented in a computer. The traditional relational database model is often inappropriate for journalistic work, so we’re going to concentrate on so-called “linked data” representations. Such representations are widely used and increasingly popular. For example Google recently released the Knowledge Graph. But generating this kind of data from unstructured text is still very tricky, as we’ll see when we look at th Reverb algorithm.

Required

Recommended

Assignment: Text enrichment experiments using OpenCalais entity extraction.

Lecture 7: Drawing conclusions from data
You’ve loaded up all the data. You’ve run the algorithms. You’ve completed your analysis. But how do you know that you are right? It’s incredibly easy to fool yourself, but fortunately, there is a long history of fields grappling with the problem of determining truth in the face of uncertainty, from statistics to intelligence analysis.

Required

Recommended

Assignment: analyze gun ownership vs. gun violence data.

Lecture 8: Security, Surveillance, and Censorship
Who is watching our online activities? How do you protect a source in the 21st Century? Who gets to access to all of this mass intelligence, and what does the ability to survey everything all the time mean both practically and ethically for journalism? In this lecture we will talk about who is watching and how, and how to create a security plan using threat modeling.

Required

Recommended

Cryptographic security

Anonymity

Assignment: Use threat modeling to come up with a security plan for a given scenario.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>