Big Data Dictionary

Hadoop Online

The original implementation of the Hadoop framework has been designed in a way that the entire output of each map and reduce task to be materialized into a local le before it can be consumed by the next stage. This materialization step allows for the implementation of a simple and elegant checkpoint/restart fault tolerance mechanism. The Hadoop Online system has been proposed as a modified architecture of the Hadoop framework in which intermediate data is pipelined between operators while preserving the programming interfaces and fault tolerance models of previous MapReduce frameworks. This pipelining approach provides important advantages to the MapReduce framework such as:

The reducers can begin their processing of the data as soon as it is produced by mappers. Therefore, they can generate and rene an approximation of their nal answer during the course of execution. In addition, they can provide initial estimates of the results several orders of magnitude faster than the nal results.
It widens the domain of problems to which MapReduce can be applied. For example, it facilitates the ability to design MapReduce jobs that run continuously, accepting new data as it arrives and analyzing it immediately (continuous queries). This allows MapReduce to be used for applications such as event monitoring and stream processing.
Pipelining delivers data to downstream operators more promptly, which can increase opportunities for parallelism, improve utilization and reduce response time.

In this approach, each reduce task contacts every map task upon initiation of the job and opens a TCP socket which will be used to pipeline the output of the map function. As each map output record is produced, the mapper determines which partition (reduce task) the record should be sent to, and immediately sends it via the appropriate socket. A reduce task accepts the pipelined data it receives from each map task and stores it in an in-memory buer. Once the reduce task learns that every map task has completed, it performs a nal merge of all the sorted runs. In addition, the reduce tasks of one job can optionally pipeline their output directly to the map tasks of the next job, sidestepping the need for expensive fault-tolerant storage in HDFS for what amounts to a temporary le. However, the computation of the reduce function from the previous job and the map function of the next job cannot be overlapped as the nal result of the reduce step cannot be produced until all map tasks have completed, which prevents eective pipelining. Therefore, the reducer treats the output of a pipelined map task as tentative until the JobTracker informs the reducer that the map task has committed successfully. The reducer can merge together spill files generated by the same uncommitted mapper, but will not combine those spill files with the output of other map tasks until it has been notied that the map task has committed. Thus, if a map task fails, each reduce task can ignore any tentative spill files produced by the failed map attempt. The JobTracker will take care of scheduling a new map task attempt, as in standard Hadoop. In principle, the main limitation of the Hadoop Online approach is that it is based on HDFS. Therefore, it is not suitable for streaming applications, in which data streams have to be processed without any disk involvement.