INFO 2950: Intro to Data Science

David Mimno, Tuesday/Thursday 1:25-2:40, Spring 2019, Hollister B14

Syllabus

For questions use Campuswire, a new Piazza alternative. Use the code 4212 to join the class. All assignments and submissions will happen through CMS.

Goal: Students will be able to select and apply common statistical methods to real-world settings, and assess the validity of results. Topics will include combinatorics, data-intensive programming, probability, regression, matrix factorization, Markov chains, and graph algorithms.

Schedule

Probability. Being certain about uncertainty. Estimating probabilities by counting and from data. Turning probabilities into other probabilities. Bayes' Theorem. Simulating random processes. File and string manipulation. Regular expressions. Word counting. Naive Bayes classifiers.

  • January 22: Course introduction, finding lead in Flint, what is data science in general and InfoSci data science in particular? Slides Data collected in class.
  • January 24: Probability. Joint and conditional. Independent variables. Notes.
  • January 29: Probability. Joint and conditional. Bayes rule. Normalization. Notes.
  • January 31: Expectation. Expectations of functions. Mean and variance. Notes and Notebook on how to choose iteration structures.
  • February 5: A real example of data collection and analysis. Variable types and conversion. Bar plots. First peek at Poisson distributions. Data and Code.
  • February 7: Regular expressions. Text as Data. Smoothed probability estimates. Notes.
  • February 12: Distributions, expectations and variance. Binomial distribution. Notes.
  • February 14: More distributions, geometric, Poisson, normal. Code and Notes.
  • February 19: Searching for distributions in football player data. Code. Data.
  • February 28: Basketball, geometric distribution for freethrows, generating a Poisson random variable by counting random numbers in intervals, Searching for Poissons in real data. Code.
  • March 5: XML data formats, escaping HTML with entities, capturing groups in RegEx; Still searching for Poissons in real data. Code.

Statistical modeling. Combining information. Making predictions. Measuring confidence. Gaussian distributions. Lots of other distributions. Variable transformations. Single-variable regression. Hypothesis testing. Multi-variable regression. Poisson and logistic regression.

  • March 12: JSON data format, correlation and covariance, linear models. Slides Code.
  • March 14: Linear models, estimates of slope and intercept, guessing π from noisy measurements, predicting movie revenue. Slides Code.
  • March 19: What makes linear models good? Loss functions. R squared. Staightening curves. Interactive example. Code.
  • March 21: Linear models on Citi Bike data, multiple inputs, indicator variables, pandas data frames; also: you are the subject of other people's data science 1000 times every single day and you will never know who or why or what the implications are for you. Code.
  • March 26: Linear models on jobs data. What does "seasonally adjusted" mean? Should we freak out that Feburary had 20,000 new jobs? Code.
  • March 28: p-values from permutation tests. You will always get an estimated slope from a linear regession, regardless of whether there is actually any relationship between variables. The smaller the dataset, the more likely it is that you will get a slope far from zero just by bad luck. Shuffling the pairing of x and y values over and over and counting how often you get a more extreme slope than the one you actually got is a good way to decide if your slope is convincing. The t test approximates this. Code.
  • April 9: Be the algorithm! Fitting the weights for multiple inputs simultaneously when they all affect each other is hard. It helps to normalized the inputs so that they have zero mean and unit variance. We can also use calculus to get "hints" about whether to increase or decrease each weight. If we just add a small fraction of the gradient over and over we get stochastic gradient descent. Interactive example
  • April 11: Polynomial regression and fitting curves with linear models. Overfitting. Poisson regression and multiplicative effects. Logistic regression. Your tastes are set early. Code.

Clustering and representation learning. Finding patterns. Distance functions. Iterative algorithms. k-Means clustering. Vectors and inner/outer products. Matrix factorization. SVD, NMF. Eigenvectors and random walk algorithms.

Course policies

Grading will be 50% problem sets, 25% Final exam/project, 20% Midterm exam, 5% Professionalism (attendance, participation, etc).

Students with disabilities: We will make every effort possible to ensure that the class works for all students. Contact Prof. Mimno if there is anything we should know about. If there is a specific event such as an exam that you are concerned about, please inform us at least two weeks in advance so that we have time to make arrangements.

SONA Credits Many researchers on campus need participants for user studies and other types of experiments. The SONA system allows you to register for such studies. You will get 0.5 percent extra for this course, up to a maximum of 2.0 percent, for each 30 minute study (or equivalent). Participating in studies is a great way to find out what real research looks like. To register, go to the SONA system.

Attendance: You have to come to class. All Friday sessions will take attendance, Tuesday/Thursday lectures will take attendance at random intervals.

Slip days: Each student can take up to five "slip days" for homework with no penalty. These are designed to accommodate ordinary stresses and mishaps: the three prelims that you have at the same time, the interview in California (good luck!), the cold that knocks you out for a day. If you turn in the wrong file or an unreadable file we will notify you within 12 hours, resubmitting the correct file will cost one slip day. For out of the ordinary events write to Prof. Mimno or a TA: serious injuries or health problems, family emergencies, etc. It's ok to not be ok!. But we can only help you if you tell us that something isn't right when we still have time to do something about it.

Integrity: We will follow university policies as outlined in the Academic Integrity Handbook. You should not take credit for work you did not do. You should not tolerate any other student doing so. You may discuss homework problems, but you have to write your own answers by yourself. You may consult online forums or look at examples, but you cannot copy text or code from them. You are not helping your friend by allowing them to not learn. You do not deserve a good grade because you are smart: there is no such thing as an "A" student, only "A" work. Remember that the focus of this class is identifying statistically unlikely patterns in data.