Professor Emery Berger
- class meetings: Tuesday/Thursday 2:30pm-3:45pm, Integrated Science Building (ISB) 221
- TA: Anna Deng (adeng@cs.umass.edu); office hours: Mondays 4-5pm, CS Building 207
- paper review site: http://systems-for-data-science.cs.umass.edu/
- Piazza page: https://piazza.com/umass/fall2017/compsci590s/home
- course GitHub site (with lecture notes): https://github.com/emeryberger/COMPSCI590S
Description
In this course, students will learn the fundamentals behind large-scale systems in the context of data science. We will cover the issues involved in scaling up (to many processors) and out (to many nodes) parallelism in order to perform fast analyses on large datasets. These include locality and data representation, concurrency, distributed databases and systems, performance analysis and understanding. We will explore the details of existing and emerging data science platforms, including map-reduce and graph analytics systems including (among others) Hadoop and Apache Spark.
This course aims to gather and unify content that underpins the design and implementation of data science platforms, and introduce content specific to that area. This information will be directly useful to MS students, who will learn the theory and practice behind data science platforms and how to deliver high-performance data analytics.
Requirements
This course is an MS level Computer Science course, though it is also open to senior undergraduates majoring in Computer Science as well as PhD students. A solid systems background and substantial programming experience is required. Programming languages to be used will definitely include Java and may include C++, Python, and Scala. There will be a midterm exam, a final exam, and course projects throughout the semester. Students will also be expected to read and write reviews of papers (see below) and “scribe” lecture notes.
Course grades will be distributed as follows (subject to change):
- Reviews and class participation: 10%
- Midterm: 20%
- Final exam: 30%
- Projects: 40%
Because this is an emerging topic, there is no textbook. Instead, we will read technical papers. These will be posted here: http://systems-for-data-science.cs.umass.edu/
(Note that this site is only open to students taking the course.)
Grades will be based on in-class participation, reviews, projects, and exams.
You must submit your reviews via the review submission site before each class, by 12 p.m.
You will be expected to scribe at least one lecture’s notes. Here is an example to use as a template (in LaTeX).
Schedule
Note: This schedule is tentative (under construction!) and subject to change, especially early in the semester.
- September 5 – Course introduction & overview: concurrency, parallelism, locality
- September 7 – Parallelism: Threads, message-passing
- September 12 – Parallelism, continued
paper: Amdahl’s Law, Gustafson’s Law, and Discussion - September 14 – Fault-tolerance
paper: Why Do Computers Stop and What Can Be Done About It?
Note: September 18 = last day to drop with no record - September 19 – Large-Scale Analytics: MapReduce / Hadoop
Project 1 assigned
paper: MapReduce: Simplified Data Processing on Large Clusters
- September 21 – no class
- September 26 -Distributed computing, Cloud Computing, “Big Data”
papers: MapReduce and Parallel DBMSs: Friends or Foes?, MapReduce: A flexible Data Processing Tool - September 28 – Databases, continued: SQL, database architectures, optimization, consistency
- [Oct 3/Oct 5 paper schedule to be adjusted]
- October 3 – “Databases”: NoSQL, key-value stores, consistent hashing (Redis, MongoDB)
- October 5 – Optimizing MapReduce: FlumeJava
paper: FlumeJava: Easy, Efficient Data-Parallel Pipelines - October 10 – no class (UMass Monday)
- Project 1 due
- October 12: Storage: Google File System / HDFS
paper: The Google File System
Note: October 16 = last day to drop with DR (graduate) - October 17 – TBD
- October 19: Midterm
Note: Last day to drop with W
- October 24 – no class
- October 26 – no class
- October 31 – Graph Analytics: Pregel: A System for Large-Scale Graph Processing
- November 2 – BigTable
paper: Bigtable: A Distributed Storage System for Structured Data - November 7 – Parallel Processing: Spark
paper: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing - November 9 – Performance: Making Sense
paper: Making Sense of Performance in Data Analytics Frameworks - November 14 – Performance, Scalability
paper: Scalability! But at what COST? - November 16 – Distributed Shared Memory
- November 21 – no class (Thanksgiving)
November 23 – no class (Thanksgiving)
November 24 – Project 2 due - November 28 -Data Warehousing
paper: Hive – A Warehousing Solution Over a Map-Reduce Framework - November 30 – Hybrid Systems (SQL + Big Data)
paper: Spark SQL: Relational Data Processing in Spark - December 5 – Large-scale Key-Value stores
paper: Dynamo: Amazon’s Highly Available Key-value Store - December 7 – Intro to machine learning systems
paper: TensorFlow: A System for Large-Scale Machine Learning - December 12 – TensorFlow, continued
December 14: Final Exam
Plagiarism Policy
All projects in this course are to be done by you / your group. Violation will result in an automatic zero on the project in question and initiation of the formal procedures of the University; in practice, this results in an F in the course at a minimum. We use an automated program and manual checks to correlate projects with each other and with prior solutions. At the same time, we encourage students to help each other learn the course material. As in most courses, there is a boundary separating these two situations. You may give or receive help on any of the concepts covered in lecture or discussion and on the specifics of programming language syntax, for example.
You are allowed to consult with other students in the current class to help you understand the project specification (i.e., the problem definition). However, you may not collaborate in any way when constructing your solution: the solution to the project must be generated by you or your group working alone. You are not allowed to work out the programming details of the problems with anyone or to collaborate to the extent that your programs are identifiably similar. You are not allowed to look at or in any way derive advantage from the existence of project specifications or solutions prepared elsewhere.
If you have any questions as to what constitutes unacceptable collaboration, please talk to the instructor right away. You are expected to exercise reasonable precautions in protecting your own work. Don’t let other students borrow your account or computer, don’t place your program in a publicly accessible directory or website, and take care when discarding printouts. Publicly posting your code anywhere (e.g., on a public github repo) constitutes a violation of academic honesty.
Accommodation Statement
The University of Massachusetts Amherst is committed to providing an equal educational opportunity for all students. If you have a documented physical, psychological, or learning disability on file with Disability Services (DS), you may be eligible for reasonable academic accommodations to help you succeed in this course. If you have a documented disability that requires an accommodation, please notify me within the first two weeks of the semester so that we may make appropriate arrangements.