· papers  · 2 min read

MapReduce: Simplified Data Processing on Large Clusters

Author(s): Jeffrey Dean and Sanjay Ghemawat

Date: 2004

Notes

  • MapReduce
    • “Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.”
    • Used for processing large data sets
    • Programs can be parallelized and executed on large clusters of machines, making it highly scalable
  • Drawbacks
    • The distributed nature can have significant drawbacks if there is a lot of network latency between the machines
  • Examples
    • Distributed Grep - The map function creates a line if it matches the pattern, the reduce function copies the intermediate data to the output
    • Count of URL Access Frequency - The map function processes logs of web page requests and outputs, the reduce function adds together all va lues for the same URL with a total count per URL pair
  • In Practice
    • A user writes both the map and the reduce function for their dataset
    • The master computer divides the dataset into M map functions and R reduce functions and sends the details to the corresponding number of workers
    • Map workers perform the map function and then store the output on their local disks
    • Reduce workers reach out to the correct map workers (orchestrated by the master program) to get the data they need to perform the reduce function
    • Reduce workers output the result to R number of output files, which are then sent back to the user program by the master computer once the work is completed

Resources

Back to Blog

Related Posts

View All Posts »
Representing groups in ATProto

Representing groups in ATProto

I wanted to add book clubs to my GoodReads-like app (Collective), but ATProto doesn't have a standard way to handle shared group resources yet. So I'm building opensocial.community—a separate service that manages groups independently from any single app. This means the same book club could potentially work across multiple apps (imagine your book club having both a reading list in Collective AND a discussion forum in another app), and groups can migrate between providers if needed. It's probably over-engineered for my use case, but might help other ATProto developers building community features.

Build the thing you wish to see in the world

Build the thing you wish to see in the world

For most of my career, I've been confusing building products with building businesses—and that confusion kept me from pursuing a lot of ideas. Two weeks off helped me realize that not everything needs to be a startup, and some of the best things we build are the ones we build just because we want them to exist.