Big Data Dictionary

HaLoop

Many data analysis techniques (e.g. PageRank algorithm, recursive relational queries, social network analysis) require iterative computations. These techniques have a common requirement which is that data are processed iteratively until the computation satisfies a convergence or stopping condition. The basic MapReduce framework does not directly support these iterative data analysis applications. Instead, programmers must implement iterative programs by manually issuing multiple MapReduce jobs and orchestrating their execution using a driver program.

The HaLoop system is designed to support iterative processing flon the MapReduce framework by extending the basic MapReduce framework with two main functionalities:

Caching the invariant data in the rst iteration and then reusing them in later iterations.
Caching the reducer outputs, which makes checking for a xpoint more efficient, without an extra MapReduce job.

The above figure illustrates the architecture of HaLoop as a modified version of the basic MapReduce framework. In order to accommodate the requirements of iterative data analysis applications, HaLoop has incorporated the following changes to the basic Hadoop MapReduce framework:

It exposes a new application programming interface to users that simplies the expression of iterative MapReduce programs. HaLoop's master node contains a new loop control module that repeatedly starts new map-reduce steps that compose the loop body until a user-specified stopping condition is met.
It uses a new task scheduler that leverages data locality.
It caches and indices application data on slave nodes. In principle, the task tracker not only manages task execution but also manages caches and indices on the slave node and redirects each task's cache and index accesses to local le system.

In principle, HaLoop relies on the same file system and has the same task queue structure as Hadoop but the task scheduler and task tracker modules are modified, and the loop control, caching, and indexing modules are newly introduced to the architecture. The task tracker not only manages task execution but also manages caches and indices on the slave node, and redirects each task's cache and index accesses to local le system.