AIPI 510: Sourcing Data for Analytics (Fall)

Course Description
In practice within industry, one of the main activities, and challenges, of implementing machine learning applications is collecting data to use in modeling. This course introduces students to both the technical and non-technical (business, regulatory, ethical) aspects of collecting, cleaning, and preparing data for use in machine learning applications. The first segment of the course will be an introduction to numerical programming focused on building skills in working with data via the Numpy and Pandas libraries, two of the most common tools used by teams working with data and modeling. Technical aspects covered will include the types of data, methods of sourcing data via the web, APIs, and from domain-specific sensors and hardware (IoT devices), an increasingly common source of analytics data in technical industries. The course also introduces methods and tools for evaluating the quality of data, performing basic exploratory data analysis, and pre-processing data for use in analytics. Non-technical aspects covered include an introduction to data privacy, GDPR, regulatory issues, bias, and industry-specific concerns regarding data usage. The course will conclude with a real-world project in which students work on a problem of their choice to extract useful insights on the problem via sourcing and analysis of multiple data sets.

Pre-Requisites
Students are expected to understand the main concepts of calculus, linear algebra and probability & statistics, as well as possess a foundational level of proficiency in Python programming.

Learning Objectives
Through this course, students will be expected to:
• Understand the different types of data and their applications in modeling
• Understand the various sources for data (sensors/hardware, APIs/web, etc) and be knowledgeable in methods to collect data from each source
• Demonstrate skills in working with data in Python, via the Numpy and Pandas libraries
• Develop experience in collecting and pre-processing data for use in analytics models, via hands-on programming
• Be able to evaluate the usefulness of datasets for analytics purposes, including measures of quality as well as quantity
• Demonstrate skills in analyzing data via exploratory data analysis
• Build an appreciation for important regulatory and ethical considerations when sourcing data for use in AI
• Gain experience in the end-to-end process of sourcing data for use in AI modeling – from identifying data needs, determining potential sources, assessing legal and ethical concerns, evaluating potential sources, cleaning and pre-processing data, and performing exploratory data analysis

Course Materials
• “Python Data Science Handbook: Essential Tools for Working with Data”, by Jake VanderPlas, O’Reilly Media; 1 edition (December 10, 2016), ISBN-13: 978-1491912058, full text and code freely available at https://jakevdp.github.io/PythonDataScienceHandbook/.

Required Free Software:
• Python 3.7.x (suggest installing Python via the Anaconda distribution (https://www.anaconda.com/distribution/)
• The following libraries must also be installed (can be installed using pip or conda):
o Numpy
o Pandas
o Jupyter Notebook
o Matplotlib
• This class will utilize GitHub for the distribution and collection of coding assignments. Please see the “Assignment Instructions” document for instructions on setting up Git and GitHub if you do not already have them, and preparing your laptop environment to work on the assignments.

Course Grading
• 30% Homework assignments (10 assignments)
• 30% Project (5% proposal, 25% final presentation)
• 30% Final Exam
• 10% Class Participation / quizzes

Jon Reifschneider
Jon Reifschneider
Director of Masters Studies, AI for Product Innovation