TEXT MINING
FOR
HISTORY & LITERATURE

This is an archive version of this class. The class will be taught in Fall 2020 by Prof. Wilkens

INFO 3350 / INFO 6350

Fall 2019. Upson 142. MWF 11:15-12:05. Additional graduate discussion section for 6350 F 12:20-1:10.

Readings and Assignments, CMS for assignments, Syllabus (course policies), Campuswire site for questions

Instructor: David Mimno (mimno at cornell dot edu).

TAs: Maria Antoniak

Goals. The course will introduce students to research methods in computer-assisted scholarship. We will learn to represent text documents in computational forms, and to appreciate the effect of choices we make in this process. We will cover a selection of popular tools such as classification, clustering, and topic modeling. Each week we will discuss both the details of computational methods and how each method can be applied in the context of scholarly research.

Should I take this class? The course is designed for students with a serious commitment to learning about the past through documents, and an interest in applying computational approaches. Although the course will use programming and data analysis, it is not primarily a programming or data analysis course. All required programming and statistics will be introduced in class. The course is targeted at those with little to no computational background; previous Python experience (such as CS 1110) is useful but not required.

Format. The class will be flipped. New material will be introduced in talks and readings from a variety of different online sources. Class time will focus on hands-on experimentation in pairs with instructor assistance and discussion. We will use two different types of readings to introduce material for this course: technical readings and discussion readings.

We anticipate that many of these readings will present more material than might be easy to digest the first time around. The goal of this is to introduce the tools and techniques of text mining, not to command to memory how each is implemented and the mathematics behind everything. We encourage you to make a good effort on understanding the material and to bring questions to class about what doesn't make sense. We will also have a graduate discussion section to dive further into the current topic with respect to individual fields and specific research issues in student work.

Office hours. Office hours will be listed on Piazza.

Schedule. The schedule can be found here. This schedule is tentative, and may changes as the semester progresses. Both the 3350 final exam and the 6350 final project will be due on the final exam date announced by the registrar.

Evaluation. Completed in-class assignment work will be due on the Monday of the following week through CMS. Reading responses will be due before Friday discussions. There will be a take-home final for students in the undergraduate section. For the graduate section there will be an independent research project applying methods learned in class to a document corpus of your choosing, written up in a report of approximately 10-20 pages.

Materials. There is no required textbook, although Matt Jockers' Text analysis with R for students of literature is recommended. All other materials will be online.

Programming resources. In order to permit pair programming, we ask the students with laptops please bring them to class. We will be programming in Python 3. To ensure that everyone has the same tools and packages available, you can install Anaconda and not have to install anything else for the rest of the semester.

Is this a Digital Humanities course? The content of this course falls in the broad category of "DH", but this course is focused on a specific area: computational text analysis. There are many other interesting directions that could also be considered DH. We will talk a lot about the role of computation, and specific computational tools, in scholarship, but we are unlikely to talk a lot about "what is/are Digital Humanities?"

Resources.