INFO 2950: Intro to Data Science

David Mimno, Tuesday/Thursday 1:25-2:40, Spring 2019, Hollister B14


For questions use Campuswire, a new Piazza alternative. Use the code 4212 to join the class. All assignments and submissions will happen through CMS.

Goal: Students will be able to select and apply common statistical methods to real-world settings, and assess the validity of results. Topics will include combinatorics, data-intensive programming, probability, regression, matrix factorization, Markov chains, and graph algorithms.


Probability. Being certain about uncertainty. Estimating probabilities by counting and from data. Turning probabilities into other probabilities. Bayes' Theorem. Simulating random processes. File and string manipulation. Regular expressions. Word counting. Naive Bayes classifiers.

  • January 22: Course introduction, finding lead in Flint, what is data science in general and InfoSci data science in particular? Slides Data collected in class.
  • January 24: Probability. Joint and conditional. Independent variables. Notes.
  • January 29: Probability. Joint and conditional. Bayes rule. Normalization. Notes.
  • January 31: Expectation. Expectations of functions. Mean and variance. Notes and Notebook on how to choose iteration structures.
  • February 5: A real example of data collection and analysis. Variable types and conversion. Bar plots. First peek at Poisson distributions. Data and Code.
  • February 7: Regular expressions. Text as Data. Smoothed probability estimates. Notes.
  • February 12: Distributions, expectations and variance. Binomial distribution.
  • February 14: More distributions, geometric, Poisson, normal. Code.

Statistical modeling. Combining information. Making predictions. Measuring confidence. Gaussian distributions. Lots of other distributions. Variable transformations. Single-variable regression. Hypothesis testing. Multi-variable regression. Poisson and logistic regression.

Clustering and representation learning. Finding patterns. Distance functions. Iterative algorithms. k-Means clustering. Vectors and inner/outer products. Matrix factorization. SVD, NMF. Eigenvectors and random walk algorithms.

Course policies

Grading will be 50% problem sets, 25% Final exam/project, 20% Midterm exam, 5% Professionalism (attendance, participation, etc).

Students with disabilities: We will make every effort possible to ensure that the class works for all students. Contact Prof. Mimno if there is anything we should know about. If there is a specific event such as an exam that you are concerned about, please inform us at least two weeks in advance so that we have time to make arrangements.

SONA Credits Many researchers on campus need participants for user studies and other types of experiments. The SONA system allows you to register for such studies. You will get 0.5 percent extra for this course, up to a maximum of 2.0 percent, for each 30 minute study (or equivalent). Participating in studies is a great way to find out what real research looks like. To register, go to the SONA system.

Attendance: You have to come to class. All Friday sessions will take attendance, Tuesday/Thursday lectures will take attendance at random intervals.

Slip days: Each student can take up to five "slip days" for homework with no penalty. These are designed to accommodate ordinary stresses and mishaps: the three prelims that you have at the same time, the interview in California (good luck!), the cold that knocks you out for a day. If you turn in the wrong file or an unreadable file we will notify you within 12 hours, resubmitting the correct file will cost one slip day. For out of the ordinary events write to Prof. Mimno or a TA: serious injuries or health problems, family emergencies, etc. It's ok to not be ok!. But we can only help you if you tell us that something isn't right when we still have time to do something about it.

Integrity: We will follow university policies as outlined in the Academic Integrity Handbook. You should not take credit for work you did not do. You should not tolerate any other student doing so. You may discuss homework problems, but you have to write your own answers by yourself. You may consult online forums or look at examples, but you cannot copy text or code from them. You are not helping your friend by allowing them to not learn. You do not deserve a good grade because you are smart: there is no such thing as an "A" student, only "A" work. Remember that the focus of this class is identifying statistically unlikely patterns in data.