COMPSCI 590S Project 1 – Multithreaded Wordcount

COMPSCI 590S Project 1: Multithreaded Wordcount

Due Thursday, September 22

Wordcount is famously the “Hello, world!” of many data science platforms (e.g., MapReduce and Spark). Your first project is to implement wordcount directly in a standard programming language – Java. We will build on this project to give you a visceral understanding of what these platforms are abstracting away, as well what the costs may be of this abstraction.

Your wordcount program should take multiple arguments. The first will be the number of threads that it will create, and the rest shall be an arbitrary number of file names.

For example, you should be able to invoke your program as follows:

% java -cp . wordcount 2 foo.txt bar.txt baz.txt

Your program should place these file names in a queue protected by a single lock. It should launch the number of threads specified in the argument. Each thread should pull one file name off of the queue, and perform its own word count in a local map. Once it finishes, it should attempt to get another file to process. When there is nothing left to do, it should (safely!) update a global word count map with the local word count. Finally, your program should print the word counts in reverse order by frequency (so, most frequent word first) – each line should list a word and its frequency. The output should look exactly like the output of the Python version of wordcount that I have included in the repository (see https://github.com/emeryberger/COMPSCI590S/tree/master/projects/project1, which also has two sample files for testing).

You need to accept the invitation link to set up your github repo for this assignment. Make sure to include your .java file named “wordcount” (along with any other required .java files). You should make frequent commits to this repository – do not simply dump code into the repository before submission, or I will assume that you are plagiarizing code. We will take a snapshot of all repositories on the due date.