Systems for Data Science (COMPSCI 590S)

Professor Emery Berger

Description

In this course, students will learn the fundamentals behind large-scale systems in the context of data science. We will cover the issues involved in scaling up (to many processors) and out (to many nodes) parallelism in order to perform fast analyses on large datasets. These include locality and data representation, concurrency, distributed databases and systems, performance analysis and understanding. We will explore the details of existing and emerging data science platforms, including map-reduce and graph analytics systems including (among others) Hadoop and Apache Spark.

This course aims to gather and unify content that underpins the design and implementation of data science platforms, and introduce content specific to that area. This information will be directly useful to MS students, who will learn the theory and practice behind data science platforms and how to deliver high-performance data analytics.

Requirements

This course is an MS level Computer Science course, though it is also open to senior undergraduates majoring in Computer Science as well as PhD students. A solid systems background and substantial programming experience is required. Programming languages to be used will definitely include Java and may include C++, Python, and Scala. There will be a midterm exam, a final exam, and course projects throughout the semester. Students will also be expected to read and write reviews of papers (see below) and “scribe” lecture notes.

Course grades will be distributed as follows (subject to change):

  • Reviews and class participation: 10%
  • Midterm: 20%
  • Final exam: 30%
  • Projects: 40%

Because this is an emerging topic, there is no textbook. Instead, we will read technical papers. These will be posted here: http://systems-for-data-science.cs.umass.edu/

(Note that this site is only open to students taking the course.)

Grades will be based on in-class participation, reviews, projects, and exams.
You must submit your reviews via the review submission site before each class, by 12 p.m.

You will be expected to scribe at least one lecture’s notes. Here is an example to use as a template (in LaTeX).

Schedule

Note: This schedule is tentative (under construction!) and subject to change, especially early in the semester.

  1. September 5 – Course introduction & overview:  concurrency, parallelism, locality
  2. September 7 – Parallelism: Threads, message-passing
  3. September 12 – Parallelism, continued
    paper: Amdahl’s Law, Gustafson’s Law, and Discussion
  4. September 14 – Fault-tolerance
    paper: Why Do Computers Stop and What Can Be Done About It?
    Note: September 18 = last day to drop with no record
  5. September 19 – Large-Scale Analytics: MapReduce / Hadoop
    Project 1 assigned
    paper:  MapReduce: Simplified Data Processing on Large Clusters
  6. September 21 – no class
  7. September 26 -Distributed computing, Cloud Computing, “Big Data”
    papers: MapReduce and Parallel DBMSs: Friends or Foes?MapReduce: A flexible Data Processing Tool
  8. September 28 – Databases, continued: SQL, database architectures, optimization, consistency
  9. [Oct 3/Oct 5 paper schedule to be adjusted]
  10. October 3 – “Databases”: NoSQL, key-value stores, consistent hashing (Redis, MongoDB)
  11. October 5 – Optimizing MapReduce: FlumeJava
    paper: FlumeJava: Easy, Efficient Data-Parallel Pipelines
  12. October 10 – no class (UMass Monday)
  13. Project 1 due
  14. October 12: Storage: Google File System / HDFS
    paper: The Google File System
    Note: October 16 = last day to drop with DR (graduate)
  15. October 17 – TBD
  16. October 19: Midterm
    Note: Last day to drop with W
  17. October 24 – no class
  18. October 26 – no class
  19. October 31 – Graph Analytics: Pregel: A System for Large-Scale Graph Processing
  20. November 2 – BigTable
    paper:  Bigtable: A Distributed Storage System for Structured Data
  21. November 7 – Parallel Processing: Spark
    paper: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
  22. November 9 – Performance: Making Sense
    paper: Making Sense of Performance in Data Analytics Frameworks
  23. November 14 – Performance, Scalability
    paper: Scalability! But at what COST?
  24. November 16 – Distributed Shared Memory

    paper: Latency-Tolerant Software Distributed Shared Memory

  25. November 21 – no class (Thanksgiving)
    November 23 – no class (Thanksgiving)
    November 24 – Project 2 due
  26. November 28 -Data Warehousing
    paper: Hive – A Warehousing Solution Over a Map-Reduce Framework
  27. November 30 – Hybrid Systems (SQL + Big Data)
    paper:  Spark SQL: Relational Data Processing in Spark
  28. December 5 – Large-scale Key-Value stores
    paper: Dynamo: Amazon’s Highly Available Key-value Store
  29. December 7 – Intro to machine learning systems
    paper: TensorFlow: A System for Large-Scale Machine Learning
  30. December 12 – TensorFlow, continued
    December 14: Final Exam

Plagiarism Policy

All projects in this course are to be done by you / your group. Violation will result in an automatic zero on the project in question and initiation of the formal procedures of the University; in practice, this results in an F in the course at a minimum. We use an automated program and manual checks to correlate projects with each other and with prior solutions. At the same time, we encourage students to help each other learn the course material. As in most courses, there is a boundary separating these two situations. You may give or receive help on any of the concepts covered in lecture or discussion and on the specifics of programming language syntax, for example.

You are allowed to consult with other students in the current class to help you understand the project specification (i.e., the problem definition). However, you may not collaborate in any way when constructing your solution: the solution to the project must be generated by you or your group working alone. You are not allowed to work out the programming details of the problems with anyone or to collaborate to the extent that your programs are identifiably similar. You are not allowed to look at or in any way derive advantage from the existence of project specifications or solutions prepared elsewhere.

If you have any questions as to what constitutes unacceptable collaboration, please talk to the instructor right away. You are expected to exercise reasonable precautions in protecting your own work. Don’t let other students borrow your account or computer, don’t place your program in a publicly accessible directory or website, and take care when discarding printouts.  Publicly posting your code anywhere (e.g., on a public github repo) constitutes a violation of academic honesty.

Accommodation Statement

The University of Massachusetts Amherst is committed to providing an equal educational opportunity for all students. If you have a documented physical, psychological, or learning disability on file with Disability Services (DS), you may be eligible for reasonable academic accommodations to help you succeed in this course. If you have a documented disability that requires an accommodation, please notify me within the first two weeks of the semester so that we may make appropriate arrangements.