Systems for Data Science (COMPSCI 590S)

In this course, students will learn the fundamentals behind large-scale systems in the context of data science. We will cover the issues involved in scaling up (to many processors) and out (to many nodes) parallelism in order to perform fast analyses on large datasets. These include locality and data representation, concurrency, distributed databases and systems, performance analysis and understanding. We will explore the details of existing and emerging data science platforms, including map-reduce and graph analytics systems like Hadoop and Apache Spark.

Prerequisites: COMPSCI 311, COMPSCI 345, and COMPSCI 377. 3 credits.
Required Texts: none
Reference Texts:

  • Programming Amazon Web Services, James Murty
  • Apache Spark Graph Processing, Rindra Ramamonjison
  • Hadoop: The Definitive Guide, 4th Edition, Tom White

Course Format:

The course consists of two meetings per week. Each meeting includes a lecture. Readings will be assigned as preparation for each class meeting. Several projects will be assigned during the course. The projects provide students with an opportunity to explore the topics in more depth and in a specialized domain. A midterm exam and a final exam will be given. Grades will be determined by a combination of projects, exam scores, and class participation. 

Course Topics and Schedule (subject to change):

  • Week 1
    • Principles of Locality
    • Memory Hierarchy
    • Caching
  • Week 2
    • Speculative Execution
    • Pre-fetching and Streaming
  • Week 3
    • Memory Management
    • Virtual Memory and Paging
    • Garbage Collection
  • Week 4
    • Locality-Sensitive Data Structures & Algorithms
      • Graphs and Tree Representations
      • Vectors and Matrices
  • Week 5
    • Concurrency
      • Hiding Latency
      • Asynchronous I/O
      • Threads
  • Week 6
    • Multithreaded Programming
    • Programming Models
      • Threads & Actors
  • Week 7
    • Parallelism
      • Vector Operations
      • Models of Parallelism
  • Week 8
    • Distributed Systems
    • Fault Tolerance
    • Transactions
      • ACID vs. BASE
      • Eventual Consistency
  • Week 9
    • Hashing
      • Consistent Hashing
      • Distributed Hash Tables
  • Week 10
    • Cloud Services
      • Amazon EC2 / S3
      • Microsoft Azure
  • Week 11
    • Performance Analysis
      • Single node
      • Concurrent Systems
      • Parallel Applications
      • Distributed Systems
    • Performance Debugging
  • Week 12
    • File Systems and OS Support
    • Google / Hadoop File System
  • Week 13
    • Databases
      • SQL Databases
      • Query Planning and Optimization
      • Optimistic Concurrency
    • NoSQL, a.k.a., Key-Value Stores
      • Redis
      • MapReduce / Hadoop
  • Week 14:
    • Large-Scale Graph Analytics
      • Pregel
      • Spark

Accommodation Statement
The University of Massachusetts Amherst is committed to providing an equal educational opportunity for all students.  If you have a documented physical, psychological, or learning disability on file with Disability Services (DS), you may be eligible for reasonable academic accommodations to help you succeed in this course.  If you have a documented disability that requires an accommodation, please notify me within the first two weeks of the semester so that we may make appropriate arrangements.

Academic Honesty Statement
Since the integrity of the academic enterprise of any institution of higher education requires honesty in scholarship and research, academic honesty is required of all students at the University of Massachusetts Amherst.  Academic dishonesty is prohibited in all programs of the University.  Academic dishonesty includes but is not limited to: cheating, fabrication, plagiarism, and facilitating dishonesty.  Appropriate sanctions may be imposed on any student who has committed an act of academic dishonesty.  Instructors should take reasonable steps to address academic misconduct.  Any person who has reason to believe that a student has committed academic dishonesty should bring such information to the attention of the appropriate course instructor as soon as possible.  Instances of academic dishonesty not related to a specific course should be brought to the attention of the appropriate department Head or Chair.  Since students are expected to be familiar with this policy and the commonly accepted standards of academic integrity, ignorance of such standards is not normally sufficient evidence of lack of intent (