This page shows the 500 topics used by Matt Jockers in his book Macroanalysis, which he extracted automatically from ~3000 mostly 19th century English-language novels. The algorithm divided the content of the novels into themes based on word co-occurrence patterns within 1000-word chunks. The 200 most probable words for each topic are shown on the right, along with a short, manually generated descriptive label.

The plot on the left shows how the topic is used within novels. Each 1000-word chunk is assigned to one of 20 equal-sized sections. (Long books will have 20 long sections, shorter books will have 20 shorter sections.) A topic that occurs more in early sections of novels will have a decreasing line, like school, while a topic that occurs in later sections will have a rising line, like punishment. The topics are sorted so that topics that occur early in novel time are at the top of the page, and topics that occur late are at the bottom. The gray shaded region represents one standard deviation on either side of the mean — there's a lot of variability between novels. The topics at the very beginning and very end are really about library and publishing practices, but go past these a little and you'll see some suggestive patterns: education and facial descriptions tend to appear early, while trials and murder appear late.

Raw data: JSON

Where do themes occur in novels?