Professor Emery Berger
- class meetings: Tuesday/Thursday 2:30pm-3:45pm, CS Building Room 142
- office hours: Thursday, 3:45pm-4:45pm, CS 344
- paper review site: http://systems-for-data-science.cs.umass.edu/
- Piazza page: https://piazza.com/umass/fall2016/compsci590s/home
- course GitHub site (with lecture notes): https://github.com/emeryberger/COMPSCI590S
In this course, students will learn the fundamentals behind large-scale systems in the context of data science. We will cover the issues involved in scaling up (to many processors) and out (to many nodes) parallelism in order to perform fast analyses on large datasets. These include locality and data representation, concurrency, distributed databases and systems, performance analysis and understanding. We will explore the details of existing and emerging data science platforms, including map-reduce and graph analytics systems including (among others) Hadoop and Apache Spark.
This course aims to gather and unify content that underpins the design and implementation of data science platforms, and introduce content specific to that area. This information will be directly useful to MS students, who will learn the theory and practice behind data science platforms and how to deliver high-performance data analytics.
This course is an MS level Computer Science course, though it is also open to senior undergraduates majoring in Computer Science as well as PhD students. A solid systems background and substantial programming experience is required. Programming languages to be used will definitely include Java and may include C++, Python, and Scala. There will be a midterm exam, a final exam, and course projects throughout the semester. Students will also be expected to read and write reviews of papers (see below) and “scribe” lecture notes.
Course grades will be distributed as follows (subject to change):
- Reviews and class participation: 10%
- Midterm: 20%
- Final exam: 30%
- Projects: 40%
Because this is an emerging topic, there is no textbook. Instead, we will read technical papers. These will be posted here: http://systems-for-data-science.cs.umass.edu/
(Note that this site is only open to students taking the course.)
Grades will be based on in-class participation, reviews, projects, and exams.
You must submit your reviews via the review submission site before each class, by 12 p.m.
Note: This schedule is tentative (under construction!) and subject to change, especially early in the semester.
- September 6 – Course introduction & overview: concurrency, parallelism, locality
- September 8 – Parallelism: Threads, message-passing
- September 13 – Parallelism, continued
paper: Amdahl’s Law, Gustafson’s Law, and Discussion
- September 15 – Fault-tolerance
paper: Why Do Computers Stop and What Can Be Done About It?
Project 1 assigned, due September 22
Note: September 19 = last day to drop with no record
- September 20 – Large-Scale Analytics: MapReduce / Hadoop
paper: MapReduce: Simplified Data Processing on Large Clusters
- September 22 – no class
Project 1 due
- September 27 -Distributed computing, Cloud Computing, “Big Data”
papers: MapReduce and Parallel DBMSs: Friends or Foes?, MapReduce: A flexible Data Processing Tool
- September 29 – Databases, continued: SQL, database architectures, optimization, consistency
- October 4 – Graph Analytics (Pregel, Arabseque): guest lecturer Marco Serafini
papers: Pregel: A System for Large-Scale Graph Processing
Arabesque: A System for Distributed Graph Mining
October 6 – no class
October 11 – no class (UMass Monday)
- October 13 -“Databases”: NoSQL, key-value stores, consistent hashing (Redis, MongoDB)
- Note: October 17 = last day to drop with DR (graduate)
- October 18 – Optimizing MapReduce: FlumeJava
paper: FlumeJava: Easy, Efficient Data-Parallel Pipelines
- October 20 – Midterm
Note: Last day to drop with W
- October 25 – Storage: Google File System / HDFS
paper: The Google File System
- October 27 – BigTable
paper: Bigtable: A Distributed Storage System for Structured Data
- November 1 – Parallel Processing: Spark
paper: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
November 3 – no class
- November 8 – Midterm review
- November 10 – Midterm review
Performance: Making Sense
paper: Making Sense of Performance in Data Analytics Frameworks
- November 15 – Performance, Scalability
paper: Scalability! But at what COST?
- November 17 – Distributed Shared Memory
- November 22 – no class (Thanksgiving)
November 24 – no class (Thanksgiving)
- November 29 -Data Warehousing
paper: Hive – A Warehousing Solution Over a Map-Reduce Framework
- December 1 – Hybrid Systems (SQL + Big Data)
paper: Spark SQL: Relational Data Processing in Spark
- December 3 – Large-scale Key-Value stores
paper: Dynamo: Amazon’s Highly Available Key-value Store
- December 6 – continued
- December 8 – Intro to machine learning systems
paper: TensorFlow: A System for Large-Scale Machine Learning
- December 13 – TensorFlow
Project 3 due
December 21: Final Exam
All projects in this course are to be done by you / your group. Violation will result in an automatic zero on the project in question and initiation of the formal procedures of the University; in practice, this results in an F in the course at a minimum. We use an automated program and manual checks to correlate projects with each other and with prior solutions. At the same time, we encourage students to help each other learn the course material. As in most courses, there is a boundary separating these two situations. You may give or receive help on any of the concepts covered in lecture or discussion and on the specifics of programming language syntax, for example.
You are allowed to consult with other students in the current class to help you understand the project specification (i.e., the problem definition). However, you may not collaborate in any way when constructing your solution: the solution to the project must be generated by you or your group working alone. You are not allowed to work out the programming details of the problems with anyone or to collaborate to the extent that your programs are identifiably similar. You are not allowed to look at or in any way derive advantage from the existence of project specifications or solutions prepared elsewhere.
If you have any questions as to what constitutes unacceptable collaboration, please talk to the instructor right away. You are expected to exercise reasonable precautions in protecting your own work. Don’t let other students borrow your account or computer, don’t place your program in a publicly accessible directory or website, and take care when discarding printouts.
The University of Massachusetts Amherst is committed to providing an equal educational opportunity for all students. If you have a documented physical, psychological, or learning disability on file with Disability Services (DS), you may be eligible for reasonable academic accommodations to help you succeed in this course. If you have a documented disability that requires an accommodation, please notify me within the first two weeks of the semester so that we may make appropriate arrangements.