» The Rest of the Story Matthew L. Jockers

My blog on February 2, about the Syuzhet package I developed for R (now available on CRAN), generated some nice press that I was not expecting: Motherboard, then The Paris Review, and several R blogs (Revolutions, R-Bloggers, inside-R) all featured the work. The press was nice, but I was not at all prepared for the focus to be placed on the one piece of the story that I had yet to explain, namely, how I used the Syuzhet code and some unsupervised machine clustering to identify what seem to be six, or possibly seven, archetypal plot shapes. So, here now is the rest of the story. . .

In brief: (A Plot Modeling Recipe)

Apply functions available in the Syuzhet package, to generate a generalized a plot shape for every book in a corpus of 41,383 novels.[1]
Employ euclidean distance to build a large distance matrix by computing the similarity between every pair of novels.
Use unsupervised hierarchical clustering to group books based on the similarity of their plot shape.
Examine the resulting clusters with furrowed brow and say “hmmmm.”
Test several methods of cluster identification (silhouette, gap statistic, elbow).
Develop ad-hoc cluster identification algorithm.
Observe that there are six, or maybe seven, fundamental plot shapes.
Repeat everything over and over again for 12 months while worrying a lot about observing six or seven plots.

Caveats:

Before I reveal the six/seven plots (scroll down if you can’t wait), it’s important to point out that what I offer here is the result of two particular methods of analysis. If you don’t like the plot shapes that these methods reveal, then you’ll be free to take issue with the methods and try a different approach. You could, for example,

Read 41,383 novels and sketch the plots of each using Vonnegut’s chalkboard. You could then spend a few decades organizing and classifying them into some sort of taxonomy. You could then work on clustering them into a finite set of foundational shapes. This is more or less the method Vonnegut employed, excepting, of course, that he probably only read a few hundred stories and probably only sketched out a few dozen on his chalk board.
You could use another method, such as the one that Benjamin Schmidt has proposed over at his Sapping Attention blog.

Background:

In my previous post, I explained how I developed some software (named “Suyzhet” in homage to Propp) to extract plot shapes from novels based on sentiment analysis. In order to understand how I derive the six/seven plot archetypes, we need to understand a little bit about Euclidean distance and hierarchical clustering. The former provides a mathematical way of computing the similarity or distance between two points in space. When that space is two dimensional, it’s pretty easy to visualize what is going on: we plot two points on an x-y grid and then measure the distance between them. When the space is three dimensional, it gets a bit harder, but you can still imagine measuring the distance between some point about three feet off the floor in your kitchen and some point about five feet off the floor in your living room. Once we go beyond the third dimension things get downright tricky, and we have to rely on the mathematics of the Euclidean metric. Regardless of the dimensions, though, the fundamental idea is the same: we are measuring the distance between points and the shorter that distance the more similar the points are. In this case the points are books, and the feature that determines their point in space is their “plot shape” as derived from Syuzhet.

Once the distances between all the points are measured, we construct a “distance matrix.” This distance matrix is just a big spread sheet where we can look up the distance from any one point to any other point. It might look something like Figure 1. According to this matrix, the distance between Book 1 and Book 3 is “0.5” whereas the distance between Book 2 and Book 3 is “0.25.”

Figure 1: A Distance Matrix

Hierarchical clustering methods use this distance matrix as a foundation upon which to build a hierarchy of similarities. This hierarchy is often visualized as a dendrogram such as seen in Figure 2.

Figure 2: Dendrogram

Figure 2 is a bit like a tree (upside down); it has branches. At any vertical point, we can cut this tree and the result would be to separate it into two or more branches, or clusters. For example, cutting the tree in Figure 2 at a height of 225, would result in four primary clusters. The trick with this sort of tree cutting, is identifying an “ideal” vertical position to insert the saw. Before I get to that, though, we need to step back for a moment to those plots created with the Syuzhet software.

The Plot Thickens

In my previous post, I showed what the plots of Joyce’s Portrait and Wilde’s Dorian Grey look like when graphed using Suyzhet. Underneath each plot graph, is a sequence of 100 numbers from which the shape of the plot is derived. I have collected these sequences for 41,383 novels, and when I average them, I get the “super average plot archetype” seen in Figure 3.

Figure 3: The Super Average Plot

That is kind of interesting, but things get a lot more interesting after a bit of tree cutting. If you look at the dendrogram in Figure 2 again, you see that cutting the tree just below 250 will result in two primary clusters. After cutting the tree at that point, it is then possible to calculate a mean shape for all the books in each cluster. The result is seen in Figure 4.

Figure 4: Two Primary Plots

In homage to Vonnegut, I have titled the shape on the left “man in hole.” 46% of the books in this corpus fall into this cluster. The remaining 54% are more similar to the plot on the right, which I have named “man on hill.” At this point, I’d encourage you to take a quick peek Maya Eilam’s very nice visualization of Vonnegut’s archetypal plot shapes. The plots I’ll show here are not going to look quite the same, but there will be some resonance.

Looking again at the dendrogram, you can see that the two primary clusters (MOH and MIH), can be split fairly easily into a set of four clusters. When the tree is cut in this manner, the two plots shown in Figure 4, split into four.

Figure 5: MIH Types I and II

Figure 5 shows the derivatives of the man in hole plot shape. The man in hole plot splits into one shape (“Type I”) that looks a lot like classical tragedy and another (“Type II”) that looks more like comedy. Whatever the case, one has a much happier ending than the other. Figure 6 shows the derivatives of the man on hill.

Figure 6: Man on Hill Types I and II

Here again, one plot leads us to a happy ending and the other to a rather dark conclusion.

Cutting the tree beyond these four shapes gets trickier. It is difficult to know where precisely to stop and cut. Move the cut point just a little bit, and we could go from having 10 clusters to 20; it is possible, in fact, to keep moving the the cut point further and further down the tree until a point at which every book is its own cluster! Doing that, however, would be rather silly (see “Caveats” item 1 above). So the objective is to find an “ideal” place to cut the tree such that the resulting clusters have the greatest amount of internal homogeneity while simultaneously being as different from each other as possible.

My solution to this problem involves iterating through a series of possible cut points and then taking two measures after each cutting. The first is a measure of cluster homogeneity the second is a measure of cluster dissimilarity. This process is more easily described in pseudocode:

Let K be a number of possible clusters from 2 to 50.

for(K in 2:50){
- cut the tree such that there are K clusters
- calculate the amount of in-cluster homogeneity
- calculate the dissimilarity between the K clusters
}

for(K in 2:50){

- cut the tree such that there are K clusters

- calculate the amount of in-cluster homogeneity

- calculate the dissimilarity between the K clusters

}

With each iteration, I store the resulting values so that I can compare them and identify a value of K that best fulfills the objectives described above. In order to make this test more robust, I opted to randomly select a subset of one half of the books in the corpus (roughly 20K) and run this test over and over again (each time with a new random sample). When I did this, I found that the method identified six as the ideal number of clusters about 90% of the time. The other 10% of the time, it said that seven or eight was a better choice.[2]

In addition to this mathematical approach, I also employed good old subjective evaluation. The tool suggested six or seven, but this number (six, seven) would be rather useless if the resulting shapes did not make any sense to those of us who actually read the books. So, I looked at a lot of plots; everything from two to twenty. After twenty, I figure there is not much point because the shapes get so similar to each other that it would be rather hard to make the case that plot 19 is really all that different from plot 20. With six and with seven, however, there remains good deal of variation.

We saw above how MIH and MOH both split into sub types. These I labeled as MIH Type I, MIH Type II, MOH Type I, and MOH Type II. At the cut point that results in six plots, MIH Type I and MOH Type II stay as we saw them above in figures 5 and 6, but MIH II and MOH I both split resulting in the shapes seen in Figure 7.

Figure 7: Level Six

Already we can begin to see some shape repetition. The variant of MIH seen in the lower right, is ultimately a steeper, or more extreme, version of the basic MIH. The other three, though, appear rather more distinct.

At level seven, MOH II splits in two resulting in the shapes shown in Figure 8. After seven, we begin to see a lot more shape repetition, and though each of these shapes is unique in terms of its precise placement on the y axis, i.e. some are more happy others more dark, the arcs are generally similar.

Obviously, there is a great deal more interpretive work to be done here. Many of these shapes, I think, can be further classified according to their “affects” and “effects.” What, for example, is the overall impression one gets from a book that takes a character to great heights (MOH) and then plunges him/her into a pit of despair from which there is no exit (as is seen in Figure 8 left).

Figure 8: Seven Plots

But perhaps even more interesting than any of this is the possibility for movement between scales. Scale hopping is something I advocate in Macroanalysis. The great power of big(ish) data is that it allows us to contextualize our small reading. Joyce’s Portrait of the Artist (Figure 9) is a type of MIH. What other books are MIHs? Are they popular books? Are they classics? Best sellers? Can we find another telling of the same story? This is the work that I am doing now, moving from the large to the small and back again. Figures 10-15 (below) present six popular/well-known novels and their corresponding plot types for consideration.

Figure 9: Joyce’s Portrait

Figure 10

Figure 11

Figure 12

Figure 13

Figure 14

Figure 15

Footnotes:

[1] The Suyzhet package performs a certain type of text analysis, and I’m claiming that the results of this analysis may serve as a pretty darn good proxy for plot. That said, I’ve been working on this problem for two years, and I know some specific places where it fails. The most spectacular example of failure was discovered by my son. He’d just finished reading one of the books in my corpus, and I showed him the plot shape from the book and asked him it it made sense. He said, “well, yes, mostly. But this spike here is all wrong.” It was a spike in good fortune, positive valence, at precisely the place in the novel where the villains had scored a major victory. The positive valence was associated with a several page long section in which the bad guys were having a very good time. Readers, of course, would see this as a negative moment in the text, Suyzhet does not. Nor does Suyzhet understand irony and dark humor and so on. On a whole, however, Suyzhet gets it right, and that’s because most books are not sustained satire, or sustained irony. Most books end up using emotional markers in a fairly consistent and conventional way. Indeed, even for an experimental novel such as Joyce’s Ulysses, Suyzhet produces a plot shape that I consider to be a good match to the ebbs and flows of the text.

[2] In a longer, less blog friendly version of this research that is to appear in a collection of essays on digital literary studies, I explain the mathematics in precise detail.