< Return to list of course descriptions

An Introduction to Big Data and Machine Learning for Survey Researchers and Social Scientists

Course Date: July 20-22

Days: W-F (1:00pm - 5:00pm)

The amount of data generated as a by-product in society is growing fast including data from satellites, sensors, transactions, social media and smartphones, just to name a few. Such data are often referred to as “big data”, and can be used to create value in different areas such as health and crime prevention, commerce and fraud detection.  An emerging practice in many areas is to append or link big data sources with more specific and smaller scale sources that often contain much more limited information.  This practice has been used for some time by survey researchers in constructing frames by appending auxiliary information that is often not directly available on the frame, but can be obtained from an external source.   Using Big Data has the potential to go beyond the sampling phase for survey researchers and in fact has the potential to influence the social sciences in general.  Big Data is of interest for public opinion researchers and agencies that produce statistics to find alternative data sources either to reduce costs, to improve estimates or to produce estimates in a more timely fashion. However, Big Data pose several interesting and new challenges to survey researchers and social scientists among others who want to extract information from data. As Robert Groves (2012) pointedly commented, the era is “appropriately called Big Data and not Big Information”, because there is a lot of work for analysts before information can be gained from “auxiliary traces of some process that is going on in society.”

This course offers participants a broad overview of big data sources, opportunities and examples motivated within the survey and social science contexts including the use of social media data, para data and other such sources.  This course also offers a detailed, practical introduction to four common machine learning methods that can be applied to big and small data alike at various aspects of a study’s lifecycle from design to nonresponse adjustments to propensity score matching to weighting and evaluation and analysis.  The machine learning methods will be demonstrated in R and we will provide several different examples of using these methods along with multiple packages in R that offer these methods.


.5 course hour
Instructor: Trent Buskirk
Prerequisite: Basic proficency in R (i.e. how to load a package, launch it and basic R syntax knowledge)