QSS 20/PBPL 40.01 Modern Statistical Computing

Social scientists are investigating questions that have led to two changes in their computing workflow.

One change is the use of new forms of data: text data to study how police officers use different language when interacting with Black drivers than with White drivers; spatial data to study the geographic clustering of autism diagnoses in more affluent communities; cellphone mobility data to (try) to estimate COVID-19 mobility patterns.

The second change is the use of new methods to discern patterns in data. Imagine a relatively simple dataset where each individual is described by a limited number of characteristics: for instance, a student and his or her demographic attributes and high school end-of-year grades. Now imagine augmenting that dataset with the forms of data described above–we know the student’s address and can thus merge in spatial data on neighborhood characteristics; we have qualitative notes from the teacher’s end-of-year reports and can investigate how those qualitative impressions correlate with grades. These require you as the researcher to have the facility to quickly pick up new methods to find patterns in large-scale data, with the methods and tools developing at a rapid pace.

This course is meant to build upon your introductory programming course and to equip you with the computing literacy to conduct social science research in the age of “big data.” This has two core components. First is learning the background tools (e.g.,Git/Github; LaTeX; working on the command line) to conduct transparent and reproducible research. Second is learning programming skills essential for social science in the big data era, with a focus on using Python for various applied tasks as well as R for tasks like data visualization and SQL for tasks like working with the relational databases that form the backbone of many real-world government and commercial datasets.

Prerequisites

  • Required: COSC 1, ENGS 20, or another programming course approved by the QSS Chair.

  • Recommended: introductory statistics course.

Textbook v. DataCamp

  • Is there a textbook? There is no textbook. For specific programming questions, DataCamp videos will give good introductions to the concepts and then stackoverflow/various online resources help fill in the gaps.

  • What is a DataCamp module and how will I keep track of them? A module in DataCamp consists of two components. First is a short introductory video to a concept like writing a loop (which you can click through if you already know the concept). Next is a series of tasks where they’ll ask you to write code to do something, you submit, and you can progress to the next task when your code successfully does what it’s supposed to. To keep track of them, we have a course page hosted within DataCamp that you will sign up for with your Dartmouth email.

All modules given as assignments will be assigned on that page. As a result, the course schedule gives the module’s general name. The modules should be completed before the corresponding class, so that you can more easily work on the in-class activity.