· papers  · 2 min read

MapReduce: Simplified Data Processing on Large Clusters

Author(s): Jeffrey Dean and Sanjay Ghemawat

Date: 2004

Notes

  • MapReduce
    • “Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.”
    • Used for processing large data sets
    • Programs can be parallelized and executed on large clusters of machines, making it highly scalable
  • Drawbacks
    • The distributed nature can have significant drawbacks if there is a lot of network latency between the machines
  • Examples
    • Distributed Grep - The map function creates a line if it matches the pattern, the reduce function copies the intermediate data to the output
    • Count of URL Access Frequency - The map function processes logs of web page requests and outputs, the reduce function adds together all va lues for the same URL with a total count per URL pair
  • In Practice
    • A user writes both the map and the reduce function for their dataset
    • The master computer divides the dataset into M map functions and R reduce functions and sends the details to the corresponding number of workers
    • Map workers perform the map function and then store the output on their local disks
    • Reduce workers reach out to the correct map workers (orchestrated by the master program) to get the data they need to perform the reduce function
    • Reduce workers output the result to R number of output files, which are then sent back to the user program by the master computer once the work is completed

Resources

Back to Blog

Related Posts

View All Posts »
Living in the inflection point

Living in the inflection point

I'm scared, I'm excited, and I'm exhausted by the pace of change. All of those things can be true at the same time. This blog post is a (hopefully) grounded take on living through AI's inflection point, why the backlash is valid, and why human connection matters more now than ever.

I guess I'm AI-pilled now?

I guess I'm AI-pilled now?

I went from brain dump to a working productivity tool in a single day. Here's how listening to the How I AI podcast pushed me to finally experiment with personalized software, MCP, agents, and skills—and why I think it's time to get on board (with some caveats).