Course Description
Every day, your organization generates new data on your customers, your processes, and your industry. But could you be using this data more effectively? Discover how to turn big data into even bigger results in this seven-week online course and earn an MIT Certificate on Data Science as well as 1.8 Continuing Education Units (CEUs) upon completion.
What You'll Learn
- Apply data science techniques to your organization’s data management challenges.
- Identify and avoid common pitfalls in big data analytics.
- Deploy machine learning algorithms to mine your data.
- Interpret analytical models to make better business decisions.
- Convert datasets to models through predictive analytics.
- Understand the challenges associated with scaling big data algorithms.
- The importance of choosing how to represent your data when making predictions.
Want to purchase this course for a group?
You can purchase enrollment codes for this course to distribute to your team
Purchase for a GroupInstructors
Devavrat Shah, Course Co-Director Director, Statistics and Data Science Center (SDSC); Professor, Electrical Engineering and Computer Science; Member, Laboratory for Information and Decision Systems (LIDS), Computer Science and Artificial Intelligence Laboratory (CSAIL), and Operations Research Center (ORC)

Devavrat Shah, Course Co-Director Director, Statistics and Data Science Center (SDSC); Professor, Electrical Engineering and Computer Science; Member, Laboratory for Information and Decision Systems (LIDS), Computer Science and Artificial Intelligence Laboratory (CSAIL), and Operations Research Center (ORC) at MIT
Dr. Shah received his Bachelor of Technology in Computer Science and Engineering from the Indian Institute of Technology, Bombay, in 1999. He received the Presidents of India Gold Medal, awarded to the best graduating student across all engineering disciplines. He received his Ph.D. in Computer Science from Stanford University. His doctoral thesis won the George B. Dantzig award from INFORMS for best dissertation in 2005. After spending a year between Stanford, Berkeley and MSRI, he started teaching at MIT in 2005. In 2013, he co-founded Celect, Inc. to commercialize his research at MIT.
Bio
Philippe Rigollet, Course Co-Director Associate Professor, Mathematics department and Statistics and Data Science Center (SDSC)

Philippe Rigollet, Course Co-Director Associate Professor, Mathematics department and Statistics and Data Science Center (SDSC) at MIT
At the University of Paris VI, Dr. Rigollet earned a B.S. in statistics in 2001, a B.S. in applied mathematics in 2002, and a Ph.D. in mathematical statistics in 2006. He has held positions as a visiting assistant professor at the Georgia Institute of Technology, and as an assistant professor at Princeton University.
Bio
Guy Bresler Assistant Professor, Electrical Engineering and Computer Science, LIDS and IDSS

Guy Bresler Assistant Professor, Electrical Engineering and Computer Science, LIDS and IDSS at MIT
He received his Ph.D. from the Department of Electric Engineering and Computer Science at UC Berkeley, and was a postdoc at MIT.
Bio
Tamara Broderick Assistant Professor, Institute for Data, Systems, and Society (IDSS), Electrical Engineering and Computer Science Department (EECS)

Tamara Broderick Assistant Professor, Institute for Data, Systems, and Society (IDSS), Electrical Engineering and Computer Science Department (EECS) at MIT
Prior to joining MIT, she earned her Ph.D. in Statistics at UC Berkeley, an AB in Mathematics from Princeton University, a Master of Advanced Study for completion of Part III of the Mathematical Tripos from the University of Cambridge, an MPhil by research in Physics from the University of Cambridge, and an MS in Computer Science from UC Berkeley. Dr. Broderick was awarded the Evelyn Fix Memorial Medal and Citation, the Berkeley Fellowship, an NSF Graduate Research Fellowship, a Marshall Scholarship, and the Phi Beta Kappa Prize.
Bio
Victor Chernozhukov Professor, Department of Economics; Statistics and Data Science Center (SDSC)

Victor Chernozhukov Professor, Department of Economics; Statistics and Data Science Center (SDSC) at MIT
Bio
David Gamarnik Professor, Sloan School of Management, IDSS, and the Operations Research Center

David Gamarnik Professor, Sloan School of Management, IDSS, and the Operations Research Center at MIT
Stefanie Jegelka Assistant Professor, Institute for Data, Systems, and Society (IDSS), Electrical Engineering and Computer Science Department (EECS)

Stefanie Jegelka Assistant Professor, Institute for Data, Systems, and Society (IDSS), Electrical Engineering and Computer Science Department (EECS) at MIT
Prior to joining MIT, she was a postdoc in the AMPlab and computer vision group at UC Berkeley, and a Ph.D. student at the Max Planck Institutes in Tuebingen and at ETH Zurich.
Bio
Jonathan Kelner Professor, Department of Mathematics and a member of the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL)

Jonathan Kelner Professor, Department of Mathematics and a member of the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) at MIT
Dr. Kelner received a B.A. in mathematics from Harvard in 2002 and the David Mumford Award as the top Harvard graduate in mathematics. He received his M.S. and Ph.D. degrees from MIT in Electrical Engineering and Computer Science in 2005 and 2006. Dr. Kelner was a Member of IAS 2006-2007 before joining the MIT faculty in applied mathematics as an assistant professor in 2007. He was named associate professor in 2012. He is a member of the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL).
Bio
Ankur Moitra Associate Professor, Department of Mathematics and member of the Computer Science and Artificial Intelligence Lab (CSAIL)

Ankur Moitra Associate Professor, Department of Mathematics and member of the Computer Science and Artificial Intelligence Lab (CSAIL) at MIT
Dr. Moitra received his B.S. in electrical and computer engineering from Cornell in 2007. He completed his M.S. in 2009 and his Ph.D. in 2011 in computer science at MIT. Notably, he received a George M. Sprowls Award and a William A. Martin Award for best thesis for his doctoral and master’s dissertations. He then spent two years as an NSF CI Fellow at the Institute for Advanced Study while he was a senior postdoc in the computer science department at Princeton University.
Bio
Caroline Uhler Associate Professor, Institute for Data, Systems, and Society (IDSS), Electrical Engineering and Computer Science Department (EECS)

Caroline Uhler Associate Professor, Institute for Data, Systems, and Society (IDSS), Electrical Engineering and Computer Science Department (EECS) at MIT
Bio
Kalyan Veeramachaneni Principal Research Scientist, MIT Laboratory for Information and Decision Systems (LIDS)

Kalyan Veeramachaneni Principal Research Scientist, MIT Laboratory for Information and Decision Systems (LIDS) at MIT
Dr. Veeramachaneni has co-founded two startups- Feature Labs and PatternEx. Feature Labs helps organizations transform their raw, noisy data into intelligent representations using data science automation tools. PatternEx, a cyber security startup is focused on developing the first active learning based solution for identifying new security threats and constantly evolving models that detect threats. His work on AI driven solutions for data science and cybersecurity has been covered by major media outlets- Washington Post, CBS news, Wired, Forbes, Newsweek, among others. Dr. Veeramachaneni received his Masters’ in Computer Engineering and Ph.D. in Electrical engineering in 2009, both from Syracuse University. After PhD he joined MIT in 2009. Bio
WHY MIT XPRO?
It’s professional development– the MIT way.
MIT xPRO courses provide professional development opportunities to individuals, teams, and companies across the world. Leveraging the latest learning technologies, MIT xPRO courses and programs are designed to provide a high quality education experience while accommodating your busy life.
MIT xPRO learners are not only scientists, engineers, technicians, managers, and consultants– they are change agents. They take the initiative, push boundaries, and define the future.
EARN A CERTIFICATE OF COMPLETION AND CEUS
Participants who successfully complete the course and all assessments will receive a Certificate in Data Science from MIT xPRO and 1.8 Continuing Education Units (CEUs). This course does not carry MIT credits or grades, however, a 60% pass rate is required in order to receive the certificate. Course requirements include:
- Submission and peer-review of two case studies
- Passing grades on eight assessments
WHO SHOULD PARTICIPATE
This course is designed for data scientists and data analysts, as well as professionals who wish to turn large volumes of data into actionable insights. Because of the broad nature of the information, the course is well suited for both early career professionals and senior managers. Since this is not an introductory course, the faculty strongly recommends participants to have substantial background knowledge of statistical techniques and data calculations or quantitative methods of data research.
Participants may include:
- Technical managers
- Business intelligence analysts
- Management consultants
- IT practitioners
- Business managers
- Data science managers
- Data science enthusiasts
COURSE REVIEWS
“Leveraging this knowledge will allow me to position myself as a hybrid analyst-data scientist, which greatly increases my value to the company.” - Ryan Michael Dickinson
“I really enjoyed the interactions/animations in the videos. These really helped with visualizing the concepts… I feel more equipped to understand what type of insights can be gleaned from a particular set of data, and can better communicate these asks to our data science team.” - Reza Dawood
“The course content was really amazing and gave me exact direction to head towards the Big Data topic.” - Prasad Sankpal
“It's very critical to keep acquiring new knowledge in today's ever changing landscape of both world order and opportunities available to professionals.” - Joanna Zarach
“The quality and pace of the videos and material is top-notch. I really like having different instructors for different modules and having two instructors interacting together makes the material more vivid and entertaining.” - Miguel Hurtado
“Armed with the knowledge I have gained from his course, I can introduce my team to certain methods that can be applied to our day to day work.” – Anonymous Learner
COURSE OUTLINE
Course materials blend the following pedagogical strategies to best achieve the learning objectives of the course and individual modules:
- Instructivism: Teacher-centered learning where the instructors present relevant content (tutorial videos enhanced with animation and graphics). Students will test their knowledge through graded tests.
- Constructivism: Learning by doing approach. We encourage learners to construct their own understanding through solving the mandatory and optional case studies and practicing.
- Social constructivism: Learning through social interactions and communication. You will be able to discuss with your peers in the discussion groups, and evaluate and get reviews from your peers through two compulsory case studies.
- Connectivism: Connecting with others and extending your knowledge through communication. You will be able to expand and share your knowledge with others through the Discussion group, and course groups on Facebook, and LinkedIn.
- Clustering
- Spectral Clustering, Components and Embeddings
- Case Studies
- Classical Linear & nonlinear regression & extension
- Modern Regression with High-Dimensional Data
- The use of modern Regression for causal inference
- Case Studies
- Hypothesis Testing and Classification
- Deep Learning
- Case Studies
- Recommendations and ranking
- Collaborative filtering
- Personalized recommendations
- Case Studies
- Wrap-up: Parting remarks and challenges
- Introduction
- Networks
- Graphical Models
- Case Studies
- Introduction
- Prediction engineering
- Feature engineering
- Modeling and evaluating predictive models
Module 1: Making sense of unstructured data
Module 2: Regression and Prediction
Module 3: Classification, Hypothesis Testing and Anomaly Detection
Module 4: Recommendation Systems
Module 5: Networks and Graphical Models
Module 6: Predictive Modeling for Temporal Data
CASE STUDY OUTLINE
In this course, you won’t just discover new strategies, tools, and insights- you’ll put them to the test. Every course module features a selection of case studies and hands-on projects that help you apply your newfound knowledge to realistic business challenges.
Module 1: Making sense of unstructured data
Case Study 1: Genetic Codes
- Case Study Activity Description: Use K-means to figure out that DNA is composed of three-letter words. We’ll help by demonstrating how to apply data visualization to genomic sequence analysis.
- Data Sets & format: DNA text string
- Tools used: Matlab
Case Study 2: LDA Analysis
- Case Study Activity Description: Find themes in project descriptions using LDA. We’ll help by giving you tips on how to do your own analysis on MIT EECS faculty data using stochastic variational inference on LDA.
- Data Sets & format: Scrape your own
- Tools used: Python
Case Study 3: PCA: Identifying Faces
- Case Study Activity Description: Implement your own image classification algorithm that helps classify photos of people’s faces. We’ll help by giving you tips on how to use PCA, along with examples and pseudo-code for the programming environment.
- Data Sets & format: Instructors photos provided (14). Any other images will work, as long as they obey the restrictions noted in the Self Help document.
- Tools used: Mathlab
Case Study 4: Spectral Clustering: Grouping News Stories
- Case Study Activity Description: : Build your own clustering for online news stories—similar to how Google News organizes stories via auto-generated topics. We’ll help by giving you tips on Spectral Clustering, along with examples and pseudo-code for the programming environment.
- Data Sets & format: Instructions for downloading news stories off the web.
- Tools used: Python
Module 2: Regression and Prediction
Case Study 1: Predicting Wages 1
- Case Study Activity Description: Predict wages and assess predictive performance using various characteristics of workers. We’ll help by describing the wage prediction model.
- Data Sets & format: CPS 2012 Data, Rdata format
- Tools used: R
Case Study 2: Gender Wage Gap
- Case Study Activity Description: Estimate the difference in predicted wages between men and women with the same job characteristics. We’ll help by describing the estimation technique and presenting the results.
- Data Sets & format: CPS 2012 Data, Rdata format
- Tools used: R
Case Study 3: Do Poor Countries Grow Faster than Rich Countries?
- Case Study Activity Description: Use a large dimensional dataset to answer the question: Do poor countries grow faster than rich countries? We’ll help by describing the estimation technique, giving you the tools, and presenting the results.
- Data Sets & format: Barro-Lee Growth Data. Rdata format.
- Tools used: R
Case Study 4: Predicting Wages 2
- Case Study Activity Description: Predict wages using several machine learning methods and splitting data. We’ll help by describing the estimation technique and presenting the results.
- Data Sets & format: 2015 CPS data, Rdata format.
- Tools used: R
Case Study 5: The Effect of Gun Ownership on Homicide Rates
- Case Study Activity Description: Use machine learning methods to estimate the effect of gun ownership on the homicide rate. We’ll help by describing the estimation technique and presenting the results.
- Data Sets & format: U.S. Census Bureau Dataset. Csv format.
- Tools used: R
MODULE 3.1: Classification and Hypothesis Testing
Case-study 1: Logistic Regression: The Challenger Disaster
- Case Study Activity Description: Learn how to apply Logistic Regression in a practical real-world setting. We’ll help by giving you tips, examples, and pseudo-code for the programming environments.
- Data Sets & format: Made available as a csv file along with the case study.
- Tools used: User Choice: Python or R. Using the statsmodels library or the built-in glm function in R.
MODULE 3.2: Deep Learning
Case Study 2: Decision boundary of a deep neural network
- Case Study Activity Description: Play with one or two layer perceptrons to assess their decision boundaries. We’ll help by explaining the multiple dimensions of perceptrons.
- Data Sets & format: Synthetic 2D data points.
- Tools used: Python (coding is not required for students)
MODULE 4: Recommendation Systems
Case Study 1: Recommending Movies
- Case Study Activity Description: Build your own recommendation system for movies like the one used by Netflix. We’ll help by giving you tips, examples, and pseudo-code for the programming environments.
- Data Sets & format: MovieLens dataset - public set
- Tools used: User Choice: Python or R For Recommenders: RecommenderLab and Graphlab-Create
Case Study 2: Recommend New Songs to Users Based on Their Listening Habits
- Case Study Activity Description: Build your own recommendation system for songs like the one used by Spotify. We’ll help by giving you tips, examples, and pseudo-code for the programming environments.
- Data Sets & format: Million Song dataset
- Tools used: User Choice: Python or R For Recommenders: RecommenderLab and Graphlab-Create
Case Study 3: Make New Product Recommendations
- Case Study Activity Description: Build your own recommendation system for products on an e-commerce website like the one used by Amazon.com. We’ll help by giving you tips, examples, and pseudo-code for the programming environments.
- Data Sets & format: Amazon Reviews data
- Tools used: User Choice: Python or R For Recommenders: RecommenderLab and Graphlab-Create
MODULE 5: Networks and Graphical Models
Case study 1: Navigation / GPS
1.1: Kalman Filtering: Tracking the 2D Position of an Object when moving with Constant Velocity
- Case Study Activity Description: Generate data, build the model for the motion dynamics, and perform the Kalman Filtering algorithm. We’ll help by giving you tips, examples, and pseudo-code for the programming environment.
- Data Sets & format: Generating your own data. Model explanation and other parameter details provided in a separate write-up.
- Tools used: Python. Using libraries like numpy, matplotlib
1.2: Kalman Filtering: Tracking the 3D Position of an Object falling due to gravity.
- Case Study Activity Description: Generate data, build the model for the motion dynamics, perform the Kalman Filtering algorithm. We’ll help by giving you tips, examples, and pseudo-code for the programming environment.
- Data Sets & format: Generating your own data. Model explanation and other parameter details provided in a separate write-up.
- Tools used: Python. Using libraries like numpy, matplotlib
Case study 2: Identifying New Genes that cause Autism
- Case Study Activity Description:Use network-theoretic ideas to identify new candidate genes that might cause autism. We’ll help by giving you tips, examples, and pseudo-code for the programming environment.
- Data Sets & format: Made available as csv files.
- Tools used: R
MODULE 6: Case studies
Case study 1: New York City
- Case Study Description: To predict the trip duration of a new york taxi cab ride, build different types of features and evaluate them. We will start by describing what a feature is in this context, then develop some very simple features and add features using the software package featuretools. We will assess how these features perform in predicting trip duration.
- Datasets and format: Multiple csv files, loaded as pandas data frames.
- Tools used: Featuretools (Deep feature synthesis)
Case study 2: Prediction Engineering Using UK Retail Dataset
- Case Study Description: Given a retail dataset we will formulate a prediction problem as a retailer would, and develop an end-to-end solution using featuretools for feature engineering and scikit learn for modeling. We will change the prediction problem and tunes its parameters and see how the model performance changes.
- Datasets and format: Multiple csv files, loaded as pandas data frames.
- Tools used: Featuretools, scikit-learn.
FREQUENTLY ASKED QUESTIONS
What is the time commitment for this course?
MIT xPRO courses are designed to fit the schedules of busy professionals. The course requires a time commitment of 4.5 hours a week comprised of videos, assigned reading, and assignments.
For participants that wish to engage with the optional case study activities, please allow an extra 3+ hours a week. These Optional Case Study tutorials will require some prior knowledge and experience with the programming language you choose to use for reproducing case study results. Generally, participants with 6 months of experience using “R” or “Python” should be successful in going through these exercises. Please note that the optional case study activities are not required and do not count towards your "grade" or earning a certificate of completion. However, there are two compulsory case studies (no coding skills required since the code and instructions are provided)
Each video module is pre-recorded enabling you to watch it anytime. While you may complete most of the program as quickly as you wish, most participants find it beneficial to adhere to the weekly schedule and participate in online discussion forums along the way.
Dealines for Graded Activities:
*Note there is one droppable Case Study and one droppable Graded Assessment (the lowest scores are dropped)
- March 10, 2019: Submit the Recommendation System Case Study (Module 4)
- March 11, 2019: Peer-Review the Recommendation System Case Study (Module 4)
- March 24, 2019: Submit Case Study (Module 6)
- March 25, 2019: Peer-Review Case Study (Module 6)
- March 25, 2019: Submit all end-of-topic graded assessments
What are the browser or other technical requirements?
Access our courses requires an Internet connection, as videos are only available via online streaming, and cannot be downloaded for offline viewing. Please take note of your company's restrictions for viewing content and/or firewall settings.
Our courseware works best with current versions of Google Chrome, Firefox, or Safari, or with Internet Explorer version 10 and above. For the best possible experience, we recommend switching to an up-to-date version of Chrome. If you do not have Chrome installed, you can get it for free here: www.google.com/chrome/browser
We are unable to fully support access with mobile devices at this time. While many components of your courses will function on a mobile device, some may not.
Who can register for this course?
U.S. sanctions do not permit us to offer this course to learners in or ordinarily residing Crimea, Sudan, Iran, Iraq, North Korea, Cuba, and Syria.
How do I register for the course?
Simply click the "Enroll Now" button above. You may be prompted to first register for a MIT xPRO account if you do not have one already. Complete this process, then continue with enrollment process.
How do I register a group of participants?
For a group of 5 or more individuals, you can pay via invoice. To be invoiced, please email mitxpro@mit.edu with the number of individuals in your group, and instructions to register will be provided. Please note that our payment terms are net zero, and all invoices must be paid prior to the course start date. Failure to remit payment before the course begins will result in removal from the course. No extensions or exceptions will be granted.
What is the registration deadline?
Individual registrations must be completed by February 4, 2019
How should I pay?
Individual registrants must complete registrations and pay online with a valid credit card at the time of registration. MIT xPRO accepts globally recognized major credit or debit cards that have a Visa, MasterCard, Discover, American Express or Diner's Club logo. Invoices will not be generated for individuals, or for groups of less than 5 people. However, all participants will receive a payment receipt. Payment must be received in full; payment plans are not available.
When will I get access to the course site?
Instructions for accessing the course site will be sent to all paid registrants via email prior to the course launch date. In order to receive these instructions, please add mitxpro@mit.edu to your “trusted senders” list. If you have not received these instructions by the course start date, visit your account dashboard to login and start the course on the advertised course start date.
I need to cancel my registration. Are there any fees?
Cancellation requests must be submitted to MITxPRO@mit.edu. Cancellation requests received after February 11, 2019 will not be eligible for a refund. To submit your request, please include your full name and order number in your email request. Refunds will be credited to the credit card used when you registered and may take up to two billing cycles to process.
Can I transfer/defer my registration for another session or course?
Admission and fees paid cannot be deferred to a subsequent session; however, you may cancel your registration and reapply at a later date.
Can someone else attend in my place?
We cannot accommodate any substitution requests at this time. Please review the time commitment section and course schedule
COURSE QUESTIONS
How do I know if this course is right for me?
Carefully review the course description page, which includes a description of course content, objectives, and target audience, and any required prerequisites.
Are there prerequisites or advance reading materials?
The course is open to any interested participant. No advance reading is required. Ability to write code/programming experience is not a requirement. Since this is not an introductory course, the faculty strongly recommends participants to have substantial background knowledge of statistical techniques and data calculations or quantitative methods of data research.
For participants that wish to engage with the optional case study activities, please allow an extra 3+ hours a week. These Optional Case Study tutorials will require some prior knowledge and experience with the programming language you choose to use for reproducing case study results. Generally, participants with 6 months of experience using “R” or “Python” should be successful in going through these exercises. Please note that the optional case study activities are not required and do not count towards your "grade" or earning a certificate of completion. However, there are two compulsory case studies (no coding skills required since the code and instructions are provided).
How long is the course?
The course is held over seven weeks. Lectures are pre-taped and you can follow along when you find it convenient, as long as you finish all required assignments by March 25, 2019. You may complete all assignments before the due date, however, you may find it more beneficial to adhere to a weekly schedule so you can stay up-to-date with the discussion forums.
How long will the course material be available online?
The materials will be available to registered and paid participants until September 25, 2019. No extensions may be granted.
What reference materials will be available at the end of the course?
Participants will have 90-day access to the archived course (includes videos, discussion boards, content, and resources).
What materials will participants keep at the end of the course?
Participants will take away program materials, and materials presented in the Resources page, including downloadable case study activities for you to work on in your spare time during or after the course.
Will I receive a Certificate of completion?
Participants who successfully complete the course and all assessments will receive a Certificate in Data Science from MIT. This course does not carry MIT credits or grades, however, a 60% pass rate is required in order to receive the certificate.
Will I receive MIT credits?
This course does not carry MIT credits. MIT xPRO offers non-credit/non-degree professional programs for a global audience. Participants may not imply or state in any manner, written or oral, that MIT or MIT xPRO is granting academic credit for enrollment in this professional course. Letter grades are not awarded for this course.
Will I earn Continuing Education Units (CEUs)?
Course participants who successfully complete all course requirements are eligible to receive 1.8 Continuing Education Units (CEUs) from MIT. CEUs may not be applied toward any MIT undergraduate or graduate level course.
After I complete this course, will I be an MIT alum?
Participants who successfully complete this course are considered MIT xPRO Alumni. Only those who complete an undergraduate or graduate degree are considered MIT alumni.
Are video captions available?
Each video for this course has been transcribed and the text can be found on the right side of the video when the captions function is turned on. Synchronized transcripts allow students to follow along with the video and navigate to a specific section of the video by clicking the transcript text. Students can use transcripts of media-based learning materials for study and review. In addition, we include a complete course transcript in a single PDF file that allows for easy reference.
I have never taken a course on the edX platform before. What can I do to prepare?
Prior to the first day of class, participants can take a demonstration course on edx.org that was built specifically to help students become more familiar with taking a course on the edX platform.