The MapReduce paradigm
The MapReduce paradigm was created in 2003 to enable processing of large data sets in a massively parallel manner. The goal of the MapReduce model is to simplify the approach to transformation and analysis of large datasets, as well as to allow developers to focus on algorithms instead of data management. The model allows for simple implementation of data-parallel algorithms. There are a number of implementations of this model, including Google’s approach, programmed in C++, and Apache’s Hadoop implementation, programmed in Java. Both run on large clusters of commodity hardware in a shared-nothing, peer-to-peer environment.
- Map
- The map function, also referred to as the map task, processes a single key/value input pair and produces a set of intermediate key/value pairs.
- Reduce
- The reduce function, also referred to as the reduce task, consists of taking all key/value pairs produced in the map phase that share the same intermediate key and producing zero, one, or more data items.
Note that the map and reduce functions, do not address the parallelization and execution of the MapReduce jobs. This is the responsibility of the MapReduce model, which automatically takes care of distribution of input data, as well as scheduling and managing map and reduce tasks.