Taking this opportunity during the May Day holiday to thoroughly analyze one of the most famous papers: "MapReduce: Simplified Data Processing on Large Clusters." Of course, this article is deeply influenced by the [article](https://www.qtmuniao.com/2019/04/30/map-reduce/) from "Muniao's Notes," so my article is purely a combination of two articles plus my own thoughts. ![MR Execution Flow](The-Whole-Arch.png) I've always heard senior colleagues talk about how powerful MapReduce distributed computing is, so today's the opportunity to dive in! ## Introduction MapReduce in the paper is actually a concept. But in another context, it can also be a programming model, as well as a distributed system implementation supporting that model. I found an article that explains this concept better: > MapReduce is a concept proposed in Google's 2004 paper (Google internally wrote the first version starting in 2003). > > In Google's context, MapReduce is both a programming model and a distributed system implementation supporting that model. Its introduction enables developers without distributed systems background to leverage large-scale clusters for high-throughput processing of massive data with relative ease. This article also has a sentence explaining the problem-solving approach for applying this technology: **Finding pain points in requirements (such as how to maintain, update, and rank massive indexes), performing high-level abstractions of key processing flows (sharding Map, on-demand Reduce), for efficient system implementation (so-called tailored solutions).** _Among these, finding an appropriate computational abstraction is the most difficult part, requiring both intuitive understanding of requirements and extremely high computer science literacy._ The quote above is from the referenced article from "Muniao's Notes." Returning to the paper, we can see that on the first page, Google's experts clearly explained what this is. > As a reaction to this complexity, we designed a new abstraction that allows us to express the simple computations we were trying to perform but hides the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library. > > This means we abstracted something to express a computational method that can hide many conceptual details (parallelization, fault tolerance, data distribution, and load balancing). > > This abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages. > > We realized that most of our computations involved applying a map operation to each logical "record" in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately. > > Most of our computations involve applying a map operation to each logical record in the input to compute a set of intermediate key/value pairs, then applying a reduce operation to all values sharing the same key to appropriately combine the derived data. > > Our use of a functional model with user-specified map and reduce operations allows us to parallelize large computations easily and to use re-execution as the primary mechanism for fault tolerance. > > We use a functional model with user-specified map and reduce operations, enabling easy parallelization of large computations. I find the last sentence very interesting: > use re-execution as the primary mechanism for fault tolerance. Using "re-execution" as the primary mechanism for fault tolerance. The paper's Abstract outlines the content: 1. Section 1 is the Introduction above 2. Section 2 describes the basic programming model and gives several examples, [Programming Model](#programming-model) 3. Section 3 describes an implementation of the MapReduce interface tailored towards our cluster-based computing environment, [Implementation](#implementation) - Implementation of MapReduce interface customized for cluster computing environments 4. Section 4 describes several refinements of the programming model that we have found useful - Several programming model improvements 5. Section 5 has performance measurements of our implementation for a variety of tasks - Performance measurements of implementing various tasks 6. Section 6 explores the use of MapReduce within Google including our experiences in using it as the basis for a rewrite of our production indexing system - MapReduce applications within Google 7. Section 7 discusses related and future work ## Programming Model The Map key is a normal key, and the value can be thought of as a string array. MapReduce, simply put, consists of two functions: the map function and the reduce function. The Map function receives an input pair and generates a set of intermediate key/value pairs, then the MapReduce library combines all intermediate values associated with the same key. ### Example Below is pseudo-code, grandfather-level, from the original paper: ```java map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); ``` This is a classic MapReduce word count implementation, one of the most common examples in the MapReduce programming model. - key: document name - value: complete document content (text string) EmitIntermediate represents the function provided by the MapReduce framework for outputting intermediate key/value pairs. Each call to this function produces a key-value pair: (word, "1"), indicating that the word appeared once. For example, processing document content "hello world hello": - First word "hello" → EmitIntermediate("hello", "1") - Second word "world" → EmitIntermediate("world", "1") - Third word "hello" → EmitIntermediate("hello", "1") Map phase intermediate results: ```shell ("hello", "1") ("world", "1") ("hello", "1") ``` ### Shuffle Stage (Automatically Completed by Framework) Between Map and Reduce, the MapReduce framework automatically performs the Shuffle operation: 1. Collect all mapper outputs 2. Sort by key (word) 3. Group all values with the same key together So the above example after Shuffle: ```shell ("hello", ["1", "1"]) ("world", ["1"]) ``` ### Reduce Function After Shuffle comes the Reduce function's work. The pseudo-code above essentially performs accumulation, which needs no further explanation. ### Complete Execution Flow Here's a larger example showing the entire MapReduce execution flow: **Assume three documents**: - document1.txt: "hello world" - document2.txt: "hello mapreduce" - document3.txt: "mapreduce world example" **Map Phase (Parallel Execution)** Mapper 1 processes document1.txt: ```shell EmitIntermediate("hello", "1") EmitIntermediate("world", "1") ``` Mapper 2 processes document2.txt: ```shell EmitIntermediate("hello", "1") EmitIntermediate("mapreduce", "1") ``` Mapper 3 processes document3.txt: ```shell EmitIntermediate("mapreduce", "1") EmitIntermediate("world", "1") EmitIntermediate("example", "1") ``` **Shuffle Phase (Automatically Completed by Framework)**: ```shell ("hello", ["1", "1"]) ("world", ["1", "1"]) ("mapreduce", ["1", "1"]) ("example", ["1"]) ``` **Reduce Phase (Parallel Execution)**: Reducer processes "hello": ```shell result = 0 result += 1 = 1 result += 1 = 2 Emit("2") # Output ("hello", "2") ``` Similarly for other words... **Final Results** ```shell ("hello", "2") ("world", "2") ("mapreduce", "2") ("example", "1") ``` #### MapReduce Framework's Role In this process, the MapReduce framework is responsible for: 1. Splitting input data into multiple shards, assigning them to different Mappers 2. Executing multiple Map tasks in parallel 3. Performing Shuffle operations, reorganizing and sorting intermediate results 4. Executing multiple Reduce tasks in parallel 5. Collecting and integrating Reduce outputs 6. Handling task failures and retries 7. Optimizing data locality, processing data on nodes where it resides when possible This pattern allows developers to focus on business logic (Map and Reduce functions) without worrying about complex issues like parallelization, distributed computing, and fault tolerance. ### More Examples Here are more examples: **Distributed Grep** **Working Principle** In this example: - Map function checks each line of input text, emitting the line if it matches a specified pattern - Reduce function is a simple identity function that directly copies intermediate results to output **Application Value** This pattern is very suitable for quickly finding text lines with specific patterns in large-scale distributed file systems. It fully utilizes MapReduce's parallel processing capabilities and is extremely efficient when searching for specific error information in TB or even PB-level log files. **Count of URL Access Frequency** **Working Principle** - **Map function**: Processes web request logs, emitting `` key-value pairs for each URL - **Reduce function**: Sums all counts for the same URL, outputting `` **Application Value** This is a fundamental operation in web analytics, crucial for understanding website traffic distribution, identifying popular content, and detecting abnormal access patterns. In large websites, daily log data can reach several TB, and MapReduce can effectively handle this scale of data. **Reverse Web-Link Graph** **Working Principle** - **Map function**: Analyzes web content, outputting `` for each discovered link - **Reduce function**: Collects all source URLs for a target URL, outputting `` **Application Value** Reverse link graphs are one of the core data structures for modern search engines, used in the following scenarios: - Foundation data for PageRank and other web importance algorithms - Analyzing citation relationships between websites - Discovering influential content creators - Providing backlink analysis tools for webmasters Building a complete web reverse link graph is a computation-intensive task, and the MapReduce model is very suitable for this naturally parallelizable problem. **Term-Vector per Host** **Working Principle** - **Map function**: Analyzes document content, extracts hostname from URL, outputs `` - **Reduce function**: Merges all term vectors for the same host, filters low-frequency terms, outputs `` **Application Value** This analysis is very valuable for understanding website content characteristics: - Can be used for website topic classification - Helps with search engine optimization - Content similarity comparison - Competitor website content analysis - Foundation data for content recommendation systems **Inverted Index** **Working Principle** - **Map function**: Parses each document, outputs `` key-value pairs - **Reduce function**: Receives all document IDs for a given word, sorts them and outputs `` key-value pairs **Application Value** Inverted indexes are the fundamental data structure for modern search engines, used for: 1. **Full-text search**: Quickly find all documents containing query terms 2. **Phrase search**: Achieve precise phrase search through positional information 3. **TF-IDF calculation**: Provide term frequency statistics for information retrieval systems 4. **Keyword highlighting**: Help frontend display matched text fragments 5. **Relevance ranking**: Provide foundation data for search result ranking MapReduce is particularly suitable for building inverted indexes because it can efficiently process large numbers of documents in parallel and naturally implement index merging in the Reduce phase. **Distributed Sort** **Working Principle** - **Map function**: Extracts the key of each record, outputs `` key-value pairs - **Reduce function**: Directly outputs all received key-value pairs without modification This seemingly simple example actually cleverly utilizes two core features of the MapReduce framework: 1. **Partitioning mechanism**: Ensures records with keys in the same range are sent to the same Reducer 2. **Sorting property**: Ensures keys received by Reducers are in order **MapReduce Framework's Special Contribution**: In distributed sorting, the MapReduce framework does most of the important work: 1. **Custom Partitioner**: ```go // Example: Range partitioner func RangePartitioner(key string, numReducers int) int { // Determine which reducer to send to based on key range // This ensures global sorting if key < "D" { return 0 } else if key < "N" { return 1 } else { return 2 } } ``` 2. **Sort Comparator**: ```go // Define natural sorting order for keys func KeyComparator(key1, key2 string) int { return strings.Compare(key1, key2) } ``` **Application Value** Distributed sorting is a fundamental operation for many big data processing workflows: 1. **Data preprocessing**: Preparing large datasets for further analysis 2. **Log analysis**: Sorting massive log records by timestamp 3. **Building indexes**: Creating sorted indexes for databases or search engines 4. **Merging sorted data**: Combining multiple sorted datasets into one 5. **TopN queries**: Quickly finding top N records for a given metric **Overall Examples Analysis and Comparison** The examples demonstrate the versatility and adaptability of the MapReduce model: 1. **Distributed Grep**: Simplest application, basically only uses Map functionality, suitable for simple filtering operations 2. **URL Access Frequency Count**: Classic word count variant, demonstrates MapReduce advantages in statistical aggregation 3. **Reverse Web-Link Graph**: Shows how to use MapReduce to build complex relationship graphs and index structures 4. **Term-Vector per Host**: Combines text analysis and aggregation functions, suitable for advanced content analysis ## Implementation Multiple different implementations of the MapReduce interface are possible. **The right choice depends on the specific environment (meaning specific problems require specific analysis).** For example, one implementation might be suitable for small shared-memory machines, another for large NUMA multiprocessors, and yet another for even larger networked machine clusters. Below is the computing environment widely used at Google: > This section describes an implementation targeted to the computing environment in wide use at Google: large clusters of commodity PCs connected together with switched Ethernet. > > In our environment: > > (1) Machines are typically dual-processor x86 processors running Linux, with 2-4 GB of memory per machine. > > (2) Commodity networking hardware is used – typically either 100 megabits/second or 1 gigabit/second at the machine level, but averaging considerably less in overall bisection bandwidth. > > (3) A cluster consists of hundreds or thousands of machines, and therefore machine failures are common. > > (4) Storage is provided by inexpensive IDE disks attached directly to individual machines. A distributed file system developed in-house is used to manage the data stored on these disks. The file system uses replication to provide availability and reliability on top of unreliable hardware. > > (5) Users submit jobs to a scheduling system. Each job consists of a set of tasks, and is mapped by the scheduler to a set of available machines within a cluster. (1) Machines typically feature dual-processor x86 architecture, running Linux systems, with 2-4 GB of memory per machine. (2) Commodity networking hardware is used — typically 100 megabits/second or 1 gigabit/second at the machine level, but the average overall bisection bandwidth is significantly lower. (3) A cluster consists of hundreds or thousands of machines, so machine failures are common. (4) Storage is provided by inexpensive IDE disks directly connected to individual machines. An internally developed distributed file system is used to manage data stored on these disks. The file system provides availability and reliability on top of unreliable hardware through data replication. (5) Users submit jobs to a scheduling system. Each job consists of a set of tasks and is mapped by the scheduler to a set of available machines within the cluster. ### Execution Process > The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The input splits can be processed in parallel by different machines. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g., hash(key) mod R). The number of partitions (R) and the partitioning function are specified by the user. > > Following figure shows the overall flow of a MapReduce operation in our implementation. When the user program calls the MapReduce function, the following sequence of actions occurs (the numbered labels in Figure correspond to the numbers in the list below): ![MR Execution Flow](The-Whole-Arch.png) 1. The MapReduce Library in the user program first splits the files into M pieces, each piece typically 16-64 MB in size; 2. The program copy on the Master is special, while other workers are assigned tasks by the master. There are typically M map tasks and R reduce tasks to assign. > There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task. Here, the word "idle" refers to idle workers - in Google's MapReduce architecture, the entire computation task is executed distributedly, including: 1. **Master (master node)**: A special program copy responsible for task scheduling and coordinating the entire computation process 2. **Workers (worker nodes)**: Other program copies responsible for executing actual computation tasks 3. **Idle workers**: Worker nodes currently not executing any tasks and in a waiting state I hope this clarifies what idle workers are. So generally speaking, the entire MapReduce workflow related to idle worker nodes: 1. When computation tasks begin, the system starts multiple program copies, with one as the master node and others as worker nodes 2. The master node maintains the state of the entire cluster, including whether each worker node is idle 3. When the master node detects that a worker node is "idle" (not executing tasks), it selects one from the pending M Map tasks or R Reduce tasks to assign to that worker node 3. A worker assigned a map task reads the contents of the corresponding input split, parses key/value pairs from the input data, and passes each pair to the user-defined map function. These intermediate key/value pairs produced by map are buffered in memory. > A worker who is assigned a map task reads the contents of the corresponding input split. It parses key/value pairs out of the input data and passes each pair to the user-defined Map function. The intermediate key/value pairs produced by the Map function are buffered in memory. In fact, Hadoop does exactly this. 4. Buffered intermediate result pairs are periodically written to local disk, then partitioned into R regions by the partitioning function. These buffered pairs on local disk are passed back to the master, so the master can inform reduce workers of these pair locations. > Periodically, the buffered pairs are written to local disk, partitioned into R regions by the partitioning function. > > The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers. 5. When a reduce worker receives the buffered pair locations mentioned above, it uses RPC to read the corresponding partition data. When a reduce worker has read all the data, it sorts by key so that all data with the same key is sorted together. > When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers. > > When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. > > The sorting is needed because typically many different keys map to the same reduce task. > > If the amount of intermediate data is too large to fit in memory, an external sort is used. If the intermediate data is too large, external sorting programs are needed. This is where performance optimization points come in. **Steps 4 and 5 together are called shuffle** 6. The Reduce Worker then iterates through this sorted intermediate data and passes this data along with its key-related data to the user's reduce function. The Reduce function's output is appended to the final output file. > The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. > > The output of the Reduce function is appended to a final output file for this reduce partition. 7. When all map tasks and reduce tasks are completed, the master wakes up the user program. At this point, the user program receives a final computation result (MapReduce call). > When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code. ### Master Data Structures Understanding the key data structures maintained by the Master node in the MapReduce framework and its core responsibilities in task coordination. The Master node is actually the "brain" of the entire MapReduce execution process, maintaining the following important data: 1. **Task State Records**: For each map and reduce task, the master node records its current state: - idle: pending tasks - in-progress: tasks assigned to workers but not yet completed - completed: finished tasks 2. **Worker Machine Identity**: For non-idle state tasks, the master node records the identity of the worker machine executing that task, used for tracking task execution and handling failures 3. **Intermediate File Metadata**: For completed map tasks, the master stores location and size information of intermediate result files produced by that task **Master as Information Transfer Channel** The Master node plays the role of an information transfer channel for intermediate result location information. When a map task completes, it informs the master node of which intermediate files were produced, along with their location and size information. Additionally, Google's MapReduce implementation has job-level encapsulation, where each job contains a series of tasks, namely Map Tasks and Reduce Tasks. To maintain metadata for a running job, it's necessary to save the state of all executing tasks and their machine IDs. This information is crucial for reduce tasks, as reduce tasks need to know where to obtain the data they need to process. Moreover, the Master also serves as an information channel from Map Task output to Reduce Task. The master node incrementally pushes this information to worker nodes executing reduce tasks. When each Map Task ends, it notifies the Master of the location information of its output intermediate results, the Master then forwards this to the corresponding Reduce Task, and the Reduce Task fetches the corresponding size of data from the corresponding location. **Note: Since Map Task completion times are not uniform, this notification → forwarding → fetching process is incremental. It's reasonable to infer that the sorting of intermediate data on the reduce side should be a continuous merge process, unlikely to be global sorting after all data is in place.** — from Muniao > The master keeps several data structures. For each map task and reduce task, it stores the state (idle, in-progress, or completed), and the identity of the worker machine (for non-idle tasks). > > The master is the conduit through which the location of intermediate file regions is propagated from map tasks to reduce tasks. Therefore, for each completed map task, the master stores the locations and sizes of the R intermediate file regions produced by the map task. Updates to this location and size information are received as map tasks are completed. The information is pushed incrementally to workers that have in-progress reduce tasks. ### Fault Tolerance Distributed systems must gracefully handle various errors on distributed machines when processing large amounts of data. The paper roughly divides failures into three types: 1. [Worker Failure](#worker-failure) 2. [Master Failure](#master-failure) 3. [Semantics in the Presence of Failures](#semantics-in-the-presence-of-failures) #### Worker Failure The master periodically pings each worker node, and if there's no response, the master marks it as a failed node. At this time, both completed and uncompleted map tasks are marked back to the initial idle state, then wait to be scheduled to other normal workers. These map tasks are then re-executed and re-stored to local disk, and the master continues to report this information to Reduce workers. If a Reducer worker has already processed one of the map tasks, it doesn't need to fetch processing data again from the master's provided information; if not processed, it continues to fetch. On the Reducer worker side, when it discovers that the map worker processing a certain segment of map tasks has failed, for example, this Reduce program is R5, processing map tasks (41-51) on Worker-37, when R5 processes M47 and discovers a failure, its countermeasures are as follows: - **Transmission Interruption Handling**: If R5's connection is interrupted while pulling data from Worker-37, it triggers exception handling ```java // Simplified pseudo-code try { fetchMapOutput(worker37, mapTaskId47); } catch (FetchFailureException e) { // Wait for Master notification of new data location waitForNotification(mapTaskId47); // Retry with new data location fetchMapOutput(worker51, mapTaskId47); } ``` - **Data Consistency**: R5 discards incomplete data partially pulled from Worker-37 ```java if (partialData && dataSource != currentSourceForTask) { discardPartialData(); fetchFromNewSource(); } ``` - **Notification Mechanism**: Master notifies Reduce tasks through the following ways ```shell 1. Heartbeat responses include updated Map output location information 2. RPC calls notify state changes 3. Reduce tasks periodically poll Master for latest mapping information ``` #### Master Failure Google's paper handles Master failure relatively simply: saving state through checkpoint mechanisms, but terminating the entire job when Master actually fails. This design is based on two considerations: 1. **Single Point Characteristic**: There's typically only one Master node in the system, with relatively low failure probability 2. **Simplified Design**: Simple failure handling mechanisms reduce system complexity However, in critical production environments, this simple handling approach obviously cannot meet high availability requirements. With the development of distributed systems, more robust Master failure handling mechanisms should be seriously considered. #### Semantics in the Presence of Failures MapReduce provides a key promise: **In deterministic operations, the results of distributed parallel execution are completely consistent with sequential execution**. This feature greatly simplifies the complexity of distributed programs. How is this "computational result consistency" achieved? **Key Mechanism: Atomic Commits** MapReduce achieves result consistency through carefully designed atomic commit mechanisms: 1. Temporary file strategy 2. Task completion flow 3. Redundant execution handling **Core Role of Atomic Operations** Atomic operations are the foundation of the entire fault tolerance mechanism, mainly reflected in two levels: 1. Master data structure updates 2. File system atomic renaming - for example, atomic renaming operations when Reduce tasks complete; only one execution instance can successfully rename, meaning if there's a successfully named file, it indicates one instance has been successfully executed ### Locality A commonly used principle in computer science is called **locality of reference** (specifically spatial locality), which states that when a program executes sequentially, after accessing a block of data, it's highly likely to access data physically adjacent to that data next. This simple assertion is the foundation for all cache effectiveness, leading to the formation of a storage hierarchy system from slow to fast, cheap to expensive, large to small (hard disk → memory → cache → registers). In distributed environments, this hierarchy system requires at least one more layer — network I/O. This is also the first sentence in the paper: "Network bandwidth is a relatively scarce resource in our computing environment." In MapReduce systems, we also fully utilize input data locality. However, this time, instead of loading data over, we **schedule** the program to go there (**Moving Computation is Cheaper than Moving Data**). If input exists on GFS, it manifests as a series of logical blocks, each of which may have several (typically three) physical replicas. For each logical block of input, we can run Map Tasks on machines where one of its physical replicas resides (if it fails, try another replica), thereby minimizing network data transmission and reducing latency while saving bandwidth. > Network bandwidth is a relatively scarce resource in our computing environment. We conserve network bandwidth by taking advantage of the fact that the input data (managed by GFS) is stored on the local disks of the machines that make up our cluster. GFS divides each file into 64 MB blocks, and stores several copies of each block (typically 3 copies) on different machines. The MapReduce master takes the location information of the input files into account and attempts to schedule a map task on a machine that contains a replica of the corresponding input data. Failing that, it attempts to schedule a map task near a replica of that task's input data (e.g., on a worker machine that is on the same network switch as the machine containing the data). When running large MapReduce operations on a significant fraction of the workers in a cluster, most input data is read locally and consumes no network bandwidth. ### Task Granularity Deep analysis of the core design principles, influencing factors, and best practices of Task Granularity in the MapReduce framework. The paper has an important viewpoint: "**M and R should be much larger than the number of worker machines**." #### M and R Should Be Much Larger Than Worker Machine Count 1. **Dynamic Load Balancing** ```shell // Simplified load balancing scenario 100 Map tasks, 10 machines - Machines 1-9: Each processes 10 tasks, uniform load - Machine 10: Slower hardware, only completes 5 tasks - Task allocator automatically redistributes remaining 5 tasks to machines that completed tasks ``` When each machine processes multiple small tasks rather than single large tasks, fast machines can process more tasks while slow machines process fewer, naturally forming performance-based work distribution. 2. **Accelerated Failure Recovery** ``` Assumption: - 2000 machines, each executing about 100 Map tasks - Single Worker-37 fails, has completed 92 Map tasks Impact: - Traditional design (1 large task per machine): Lose entire Worker-37 computation results - Fine-grained design: Only need to re-execute 92 small tasks, distributed to other 1999 machines - Recovery speed: About 1/20 of traditional method (average less than 1 additional task per machine) ``` When a worker fails, its completed multiple small tasks can be quickly redistributed to other machines in the cluster for re-execution, significantly accelerating recovery speed. ### Backup Tasks This mechanism was designed by Google to solve the "straggler" problem. > One of the common causes that lengthens the total time taken for a MapReduce operation is a "straggler": a machine that takes an unusually long time to complete one of the last few map or reduce tasks in the computation. An important optimization introduced to solve slow task problems. #### Straggler Problem: Key Challenge in Distributed Systems "Stragglers" refer to machines that complete tasks abnormally slowly, severely slowing down the completion time of entire MapReduce jobs. This problem is particularly prominent in large-scale distributed environments. ##### Main Causes 1. **Hardware Issues**: ``` - Disk errors: Correctable errors reduce read speed from 30MB/s to 1MB/s - Network problems: Network card failures cause bandwidth reduction - CPU or memory failures: Significantly reduced processing capability ``` 2. **Resource Competition**: ``` - Multi-task scheduling conflicts: Other jobs occupy CPU, memory resources - I/O contention: Multiple processes compete for disk or network I/O - Memory pressure: Insufficient memory causes frequent page swapping ``` 3. **Software Issues**: ``` - Configuration errors: Such as Google's processor cache disabled bug (100x performance degradation) - GC pauses: Long pauses caused by garbage collection - System updates: Background services or updates consuming resources ``` ##### Impact of Stragglers In MapReduce jobs, job completion time is limited by the last completing task. When 99% of tasks complete quickly but a few tasks are abnormally slow, the entire job's completion time will be dominated by these slow tasks. Google designed the backup task mechanism as a simple yet effective strategy, using moderate resource redundancy to reduce overall execution time. ## Refinements The paper provides several extensions beyond the basic Mapper and Reducer primitives that have become recognized as standard components: Partitioner, Combiner, and Reader/Writer. ### Partitioning Function The Partitioning Function in MapReduce is an important extension mechanism connecting the Map and Reduce phases. #### Core Concept The partitioning function is a key component in MapReduce that connects the Map and Reduce phases, determining which intermediate key-value pairs are sent to which Reduce task. ```go // Basic definition of partitioning function type PartitionFunc func(key interface{}, numPartitions int) int ``` Core functions: - Determines which Reduce task processes Map output intermediate key-value pairs, ultimately affecting output file organization - Controls the number and content organization of final output files - Affects data distribution and load balancing across the cluster #### Default Hash Partitioning Mechanism MapReduce provides a simple and efficient default partitioning strategy: ```java // Default hash partitioning implementation public int getPartition(K key, V value, int numReduceTasks) { return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks; } ``` Characteristics: - **Simple and efficient**: Low computational overhead, suitable for most scenarios - **Relatively balanced**: Uses hash function to evenly distribute keys across partitions - **Deterministic**: Same keys always map to the same partition, ensuring correct aggregation Data flow diagram: ```shell Map Output Partition Function Reduce Tasks [K1:V1, K2:V2] hash(K) % R -> Reduce-0 [K3:V3, K4:V4] -> Reduce-1 [K5:V5, K6:V6] -> Reduce-2 ... ... ``` #### Custom Partitioning Function Use Cases The paper points out that certain scenarios require specific partitioning logic, such as when processing URL data where you want all URLs from the same host to go into the same output file. ### Combiner Function #### Core Concept and Working Principle Combiner is a key optimization component in the MapReduce framework that **performs partial data aggregation on the Map side**, reducing network transmission volume. ``` Workflow: Map output → Combiner local aggregation → Network transmission → Reducer final aggregation ``` Applicable conditions: - Map output intermediate keys have **lots of duplicates** - Reduce function has **commutativity and associativity** (such as sum, max) #### Performance Advantage Example Using word count as an example: ``` Without Combiner: Map1 output: <"hello",1>, <"world",1>, <"hello",1>, <"hello",1> // 4 records transmitted Map2 output: <"hello",1>, <"hadoop",1>, <"hello",1> // 3 records transmitted Total network transmission: 7 records With Combiner: Map1 output: <"hello",3>, <"world",1> // 2 records transmitted Map2 output: <"hello",2>, <"hadoop",1> // 2 records transmitted Total network transmission: 4 records (43% reduction) ``` For data following Zipf distribution (like word frequency), Combiner can significantly reduce network transmission and improve performance. #### Difference from Reduce - Reducer output writes to final result files - Combiner output writes to intermediate files, then transmitted to Reducer Combiner is an important optimization in the MapReduce framework for improving data processing efficiency through "pre-aggregation," significantly reducing data transmission and processing time, especially effective for aggregation operations. ### Input and Output Types Support for different input data formats: ``` 1. TextInputFormat (default) - Each line as one record - Key: line offset (LongWritable) - Value: line content (Text) - Smart splitting: ensures splitting at line boundaries 2. KeyValueTextInputFormat - Split each line into key-value by delimiter (default Tab) - Suitable for: simple structured text data 3. SequenceFileInputFormat - Read binary sequence files (key-value pairs) - Support compression, efficient random access - Commonly used for passing data between MapReduce jobs 4. DBInputFormat - Read records from relational databases - Support SQL queries as data source ``` This demonstrates the **flexibility and extensibility** of the framework. MapReduce's input/output interface design enables it to handle diverse data sources: ``` - HDFS files - Local file systems - S3, Azure Blob and other cloud storage - HBase, MongoDB and other NoSQL databases - Kafka streaming data ``` By implementing appropriate InputFormat/OutputFormat, developers can integrate MapReduce with almost any data source/target, demonstrating the framework's powerful extensibility and making it suitable for various big data processing scenarios. ## Performance The paper mentions two performance tests representing MapReduce framework's capability to handle two typical big data processing scenarios: **Test Scenario Analysis** 1. **Pattern Search Test**: Search for specific patterns in about 1TB of data - Represents the "extract small amounts of valuable information from large datasets" computation pattern - Typical applications include log analysis, anomaly detection, specific record finding, etc. 2. **Large Data Sorting Test**: Sort about 1TB of data - Represents the "transform data from one representation to another" computation pattern - Typical applications include ETL processes, data preprocessing, data reorganization, etc. Below, the entire MapReduce distributed framework will be analyzed from five angles: 1. Cluster Configuration 2. Grep 3. Sort 4. Effect of Backup Tasks 5. Machine Failures ### Cluster Configuration Analysis First, a cluster configuration is given: ```go // Cluster configuration overview type ClusterConfig struct { Nodes int // About 1800 machines CpuPerNode int // 2 × 2GHz Intel Xeon per node (with hyperthreading) MemoryPerNode string // 4GB per node (actual usable 2.5-3GB) DisksPerNode int // 2 × 160GB IDE disks per node NetworkBandwidth string // Gigabit Ethernet NetworkTopology string // Two-level tree switching network RootBandwidth string // 100-200Gbps aggregate bandwidth Latency string // <1ms latency between nodes } ``` **Analysis**: - This is a balanced cluster design for both computation and I/O, particularly suitable for MapReduce's divide-and-conquer model - Storage-wise, each node has over 300GB total storage space, providing sufficient local storage for TB-level data processing - Network-wise, tree topology is simple but may form bottlenecks during shuffle phase - Low inter-node latency (<1ms) is extremely beneficial for data transmission in the reduce phase - Cluster scale (1800 nodes) enables effective parallelization when processing TB-level data ### Grep ![Figure2 Data Transfer Rate Over Time](Figure2-Data-transfer-rate-over-time.png) This is a typical "extract small amounts of information from massive data" scenario: ```go // Grep task configuration type GrepJobConfig struct { InputSize string // About 1TB (10^10 100-byte records) Pattern string // Three-character pattern (matches 92,337 records) InputSplits int // M=15000 (about 64MB per block) ReduceTasks int // R=1 (single output file) PeakScanRate string // >30GB/s (with 1764 workers) TotalTime int // About 150 seconds (including 60 seconds startup overhead) } ``` **Performance Analysis**: 1. **Scalability Performance**: From Figure 2, as worker count increases, scan rate linearly improves to 30GB/s, showing excellent horizontal scaling capability for map-intensive tasks 2. **I/O Bound Characteristics**: Grep is essentially I/O intensive work; the test achieving 30GB/s throughput approaches the theoretical upper limit of total disk I/O for 1764 nodes 3. **Optimization Opportunities**: About 60 seconds startup overhead (40% of total time) shows an optimization point - GFS metadata operations and task distribution can be further optimized 4. **R=1 Design**: Single reduce design suits this "filtering" scenario, but also means final result collection could become a bottleneck (not manifested in this example due to small data volume) ### Sort ![Figure3 Data Transfer Rates Over Time for Different Executions](Figure3-Data-transfer-rates-over-time-for-diff-exec-of-sort-program.png) Comprehensive test of the entire MapReduce framework capability: ```golang // Sort task characteristic analysis type SortJobAnalysis struct { InputSize string // About 1TB (10^10 100-byte records) InputRate string // Peak 13GB/s (lower than Grep due to writing intermediate data) ShufflePattern string // Two-phase pattern, related to reduce task batching OutputRate string // 2-4GB/s (dual replica writes, actual physical writes 4-8GB/s) MapTasks int // M=15000 (about 64MB per block) ReduceTasks int // R=4000 (partitioning strategy leverages key distribution knowledge) TotalTime int // 891 seconds (close to TeraSort benchmark 1057 seconds) } ``` **Technical Analysis**: 1. **Data Pipeline**: Test clearly shows MapReduce three-phase pipeline - map phase (0-200 seconds), shuffle phase (200-600 seconds), and reduce phase (600-850 seconds) 2. **Resource Bottleneck Shifting**: - 0-200 seconds: Bottleneck in disk I/O and CPU (parsing data) - 200-600 seconds: Bottleneck shifts to network bandwidth (shuffle) - 600-850 seconds: Bottleneck in sorting computation and output disk I/O 3. **Locality Optimization Effect**: Input rate (13GB/s) higher than shuffle rate mainly due to data locality optimization, with most reads going through local disk rather than network 4. **Replication Overhead**: Output rate (2-4GB/s) relatively low mainly due to GFS dual replica strategy, actual physical writes are twice this rate #### Effect of Backup Tasks Figure 3 also shows: ```go // Backup task impact analysis type BackupTaskAnalysis struct { WithBackup int // Normal execution, total time 891 seconds WithoutBackup int // 1283 seconds, 44% increase StragglerDelay int // Last 5 reduce tasks took additional 300 seconds EfficiencyGain string // Backup task mechanism improves performance by 44% } ``` **Professional Interpretation**: 1. **Severity of Straggler Problem**: Data clearly demonstrates the severity of the "straggler problem" in distributed systems - just 5 slow tasks increased total time by 44% 2. **Root Cause Analysis**: Stragglers typically stem from: - Hardware anomalies (like disk performance degradation, memory errors) - Resource competition (like interference from other processes) - Data skew (some reduce tasks process significantly more data than others) 3. **Cloud-Native Environment Significance**: In shared resource cloud environments, straggler problems are more prevalent; backup task mechanisms are key to ensuring performance predictability #### Machine Failures Figure 3 also shows: ```go // Fault tolerance capability analysis type FaultToleranceAnalysis struct { NodesKilled int // 200 nodes (about 11.5% of nodes) RecoveryPattern string // Brief negative input rate, then quick recovery TotalTime int // 933 seconds, only 5% increase KeyMechanism string // Automatically detect failures and re-execute tasks } ``` **Analysis**: 1. **Failure Impact Visualization**: Negative input rate in the graph intuitively shows how node failures cause loss of completed work and re-execution requirements 2. **Quick Recovery Principle**: - Task state tracking: master node continuously tracks each task's state - Heartbeat detection: detect worker failures through periodic heartbeats - Task rescheduling: reassign failed node tasks to healthy nodes - Redundant execution: key is MapReduce design allowing any node to handle any task 3. **Comparison with Traditional Systems**: Traditional MPP databases typically fail completely or have 50%+ performance degradation with 11% node failure; MapReduce's 5% performance loss highlights its excellent fault tolerance capability ## Experience This content is excerpted from Jeff Dean and Sanjay Ghemawat's MapReduce paper, detailing MapReduce's early development history and application within Google. ![Table1 MapReduce Jobs Run in August 2004](Table1-MapReduce-jobs-run-in-august-2004.png) ### Technical Development History The first version of the MapReduce library was developed in February 2003, with major enhancements in August of the same year, including: - Locality optimization - Dynamic load balancing of task execution across worker nodes - Other performance optimizations ### Wide Application Scope MapReduce gained widespread application within Google, covering multiple domains: 1. Large-scale machine learning problems 2. Clustering problems for Google News and Froogle (early Google Shopping) products 3. Popular query report data extraction (such as Google Zeitgeist) 4. Attribute extraction from large-scale web corpora (such as geographical locations for localized search) 5. Large-scale graph computations ### Explosive Growth From the graph, MapReduce usage within Google showed exponential growth: - Early 2003: Near 0 instances - End of September 2004: Nearly 900 instances This rapid growth indicates MapReduce gained extremely high recognition and application value within Google. ### Success Factor Analysis Key factors for MapReduce success: 1. **Simplified Distributed Computing**: Enables developers to write simple programs that run efficiently on thousands of machines 2. **Accelerated Development Cycles**: Significantly shortened development and prototyping cycles 3. **Lowered Technical Barriers**: Allows programmers without distributed/parallel systems experience to easily leverage large-scale computing resources ### Scale and Efficiency Analysis (August 2004 Data) From table data, the following insights emerge: 1. **Wide Usage**: Executed 29,423 MapReduce jobs in a single month 2. **High Processing Efficiency**: Average job completion time was 634 seconds (about 10.5 minutes) 3. **Large Computation Scale**: - Used equivalent of 79,186 machine-days of computation time - Processed 3,288 TB of input data - Generated 758 TB of intermediate data - Output 193 TB of results 4. **Task Distribution Characteristics**: - Each job used average of 157 worker machines - Average 1.2 worker node failures per job (indicating good fault tolerance) - Average 3,351 map tasks and 55 reduce tasks per job 5. **Code Reusability**: - 395 unique map implementations - 269 unique reduce implementations - 426 unique map/reduce combinations ## Conclusions The paper's conclusions explain why this paper is so famous: ### Three Key Success Factors of MapReduce #### 1. Simple and Easy-to-Use Programming Model MapReduce's primary success factor is its concise programming interface: ```go // Users only need to define these two functions, without worrying about distributed system complexity func Map(key, value string) []KeyValue { /* User-defined mapping logic */ } func Reduce(key string, values []string) string { /* User-defined reduction logic */ } ``` This design is extremely developer-friendly because it: - Hides the complex details of parallelization - Automatically handles fault tolerance mechanisms - Built-in locality optimization - Provides transparent load balancing This enables even programmers without distributed systems experience to easily write efficient distributed programs. #### 2. Powerful Expressiveness The MapReduce model can easily express various types of computational problems, widely applied within Google for: - Web search service data generation - Large-scale sorting - Data mining - Machine learning - Many other systems This versatility makes MapReduce a fundamental computing framework within Google. #### 3. Excellent Scalability MapReduce implementation can scale to large clusters containing thousands of machines: ```go // Pseudo-code: MapReduce scheduling process func Schedule(input []string, mappers int, reducers int) Result { // Automatically handles: // 1. Task allocation and parallelization // 2. Machine failure detection and recovery // 3. Data locality optimization // 4. Intermediate result management } ``` This enables efficient handling of large-scale computational problems encountered by Google, laying the foundation for big data processing. ### Three Key Insights from Research Team #### 1. Value of Restricted Programming Models Research shows that by consciously restricting the programming model, enormous system advantages can be gained: - Easy parallelization and distributed computing - Natural implementation of fault tolerance mechanisms - Reduced development and maintenance costs This "less is more" philosophy contrasts sharply with other systems attempting to provide completely general parallel programming environments. #### 2. Network Bandwidth is a Scarce Resource The research team found network bandwidth is a precious resource in distributed systems, so many optimizations target reducing network transmission: - **Locality optimization**: Prioritize reading data from local disks, reducing cross-network data transmission - **Local intermediate data storage**: Write intermediate results to local disks rather than distributed storage, saving network bandwidth These designs are particularly important in large-scale clusters, where data transmission can become system bottlenecks. #### 3. Importance of Redundant Execution Redundant execution is a key innovation in MapReduce, used for: - Reducing impact of slow machines (stragglers) - Gracefully handling machine failures - Preventing data loss ```go // Pseudo-code: Redundant task scheduling in MapReduce func scheduleBackupTasks(slowTasks []Task) { for _, task := range slowTasks { if time.Now() - task.StartTime > slowThreshold { // Launch backup copy of same task on another machine launchDuplicateTask(task) } } } ``` This mechanism significantly improves reliability and performance consistency of large distributed systems. ### Technical Legacy of MapReduce The paper's conclusions reveal that MapReduce is not just a technical innovation, but a new paradigm for large-scale data processing: 1. **Architectural Influence**: MapReduce design philosophy influenced later big data processing frameworks like Hadoop, Spark 2. **Programming Model Innovation**: Proved that simplified programming models can solve complex distributed computing problems 3. **Engineering Practice Revolution**: Changed methods for building large-scale data processing systems, from expert systems to general frameworks 4. **Commercial Value Creation**: Laid foundation for later big data ecosystems, creating enormous commercial value In summary, MapReduce successfully solved core challenges of large-scale distributed data processing through simple yet powerful abstractions, making massive data processing accessible. This is why it achieved tremendous success both within Google and across the industry.