Systems for Data Science (COMPSCI 590S)

Professor Emery Berger

Description

In this course, students will learn the fundamentals behind large-scale systems in the context of data science. We will cover the issues involved in scaling up (to many processors) and out (to many nodes) parallelism in order to perform fast analyses on large datasets. These include locality and data representation, concurrency, distributed databases and systems, performance analysis and understanding. We will explore the details of existing and emerging data science platforms, including map-reduce and graph analytics systems including (among others) Hadoop and Apache Spark.

This course aims to gather and unify content that underpins the design and implementation of data science platforms, and introduce content specific to that area. This information will be directly useful to MS students, who will learn the theory and practice behind data science platforms and how to deliver high-performance data analytics.

Requirements

This course is an MS level Computer Science course, though it is also open to senior undergraduates majoring in Computer Science as well as PhD students. A solid systems background and substantial programming experience is required. Programming languages to be used will definitely include Java and may include C++, Python, and Scala. There will be a midterm exam, a final exam, and course projects throughout the semester. Students will also be expected to read and write reviews of papers (see below) and “scribe” lecture notes.

Course grades will be distributed as follows (subject to change):

  • Reviews and class participation: 10%
  • Midterm: 20%
  • Final exam: 30%
  • Projects: 40%

Because this is an emerging topic, there is no textbook. Instead, we will read technical papers. These will be posted here: http://systems-for-data-science.cs.umass.edu/

(Note that this site is only open to students taking the course.)

Grades will be based on in-class participation, reviews, projects, and exams.
You must submit your reviews via the review submission site before each class, by 12 p.m.

Schedule

Note: This schedule is tentative (under construction!) and subject to change, especially early in the semester.

  1. September 6 – Course introduction & overview:  concurrency, parallelism, locality
  2. September 8 – Parallelism: Threads, message-passing
  3. September 13 – Parallelism, continued
    paper: Amdahl’s Law, Gustafson’s Law, and Discussion
  4. September 15 – Fault-tolerance
    paper: Why Do Computers Stop and What Can Be Done About It?
    Project 1 assigned, due September 22
    Note: September 19 = last day to drop with no record
  5. September 20 – Large-Scale Analytics: MapReduce / Hadoop
    paper:  MapReduce: Simplified Data Processing on Large Clusters
  6. September 22 – no class
    Project 1 due
  7. September 27 -Distributed computing, Cloud Computing, “Big Data”
    papers: MapReduce and Parallel DBMSs: Friends or Foes?MapReduce: A flexible Data Processing Tool
  8. September 29 – Databases, continued: SQL, database architectures, optimization, consistency
  9. October 4 – Graph Analytics (Pregel, Arabseque): guest lecturer Marco Serafini
    papers: Pregel: A System for Large-Scale Graph Processing
    Arabesque: A System for Distributed Graph Mining
    October 6 – no class
    October 11 – no class (UMass Monday)
  10. October 13 -“Databases”: NoSQL, key-value stores, consistent hashing (Redis, MongoDB)
  11. Note: October 17 = last day to drop with DR (graduate)
  12. October 18 – Optimizing MapReduce: FlumeJava
    paper: FlumeJava: Easy, Efficient Data-Parallel Pipelines
  13. October 20 – Midterm
    Note: Last day to drop with W
  14. October 25 – Storage: Google File System / HDFS
    paper: The Google File System
  15. October 27 – BigTable
    paper:  Bigtable: A Distributed Storage System for Structured Data
  16. November 1 – Parallel Processing: Spark
    paper: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
    November 3 – no class
  17. November 8 – Midterm review
  18. November 10 – Midterm review
    Performance: Making Sense
    paper: Making Sense of Performance in Data Analytics Frameworks
  19. November 15 – Performance, Scalability
    paper: Scalability! But at what COST?
  20. November 17 – Distributed Shared Memory

    paper: Latency-Tolerant Software Distributed Shared Memory

  21. November 22 – no class (Thanksgiving)
    November 24 – no class (Thanksgiving)
  22. November 29 -Data Warehousing
    paper: Hive – A Warehousing Solution Over a Map-Reduce Framework
  23. December 1 – Hybrid Systems (SQL + Big Data)
    paper:  Spark SQL: Relational Data Processing in Spark
  24. December 3 – Large-scale Key-Value stores
    paper: Dynamo: Amazon’s Highly Available Key-value Store
  25. December 6 – continued
  26. December 8 – Intro to machine learning systems
    paper: TensorFlow: A System for Large-Scale Machine Learning
  27. December 13 – TensorFlow
    Project 3 due
    December 21: Final Exam

Plagiarism Policy

All projects in this course are to be done by you / your group. Violation will result in an automatic zero on the project in question and initiation of the formal procedures of the University; in practice, this results in an F in the course at a minimum. We use an automated program and manual checks to correlate projects with each other and with prior solutions. At the same time, we encourage students to help each other learn the course material. As in most courses, there is a boundary separating these two situations. You may give or receive help on any of the concepts covered in lecture or discussion and on the specifics of programming language syntax, for example.

You are allowed to consult with other students in the current class to help you understand the project specification (i.e., the problem definition). However, you may not collaborate in any way when constructing your solution: the solution to the project must be generated by you or your group working alone. You are not allowed to work out the programming details of the problems with anyone or to collaborate to the extent that your programs are identifiably similar. You are not allowed to look at or in any way derive advantage from the existence of project specifications or solutions prepared elsewhere.

If you have any questions as to what constitutes unacceptable collaboration, please talk to the instructor right away. You are expected to exercise reasonable precautions in protecting your own work. Don’t let other students borrow your account or computer, don’t place your program in a publicly accessible directory or website, and take care when discarding printouts.

Accommodation Statement

The University of Massachusetts Amherst is committed to providing an equal educational opportunity for all students. If you have a documented physical, psychological, or learning disability on file with Disability Services (DS), you may be eligible for reasonable academic accommodations to help you succeed in this course. If you have a documented disability that requires an accommodation, please notify me within the first two weeks of the semester so that we may make appropriate arrangements.