Big Data Dictionary

Pregel

In general, graph algorithms can be written as a series of chained Hadoop invocations that requires passing the entire state of the graph from one stage to the next. However, this approach is ill-suited for graph processing and can lead to suboptimal performance due to the additional communication and associated serialization overhead in addition to the need of coordinating the steps of a chained Hadoop. The Pregel system has been introduced by Google as scalable platform for implementing graph algorithms. It relies on a vertex-centric approach, which is inspired by the Bulk Synchronous Parallel model (BSP), where programs are expressed as a sequence of iterations, in each of which a vertex can receive messages sent in the previous iteration, send messages to other vertices and modify its own state as well as that of its outgoing edges or mutate graph topology. In particular, Pregel computations consist of a sequence of iterations, called supersteps. During a superstep the framework invokes a user-dened function for each vertex, conceptually in parallel, which specifies the behavior at a single vertex V and a single superstep S.

The above figure illustrates the Bulk Synchronous Parallel model (BSP) Programming Model where each superstep S can read messages sent to V in superstep S - 1, send messages to other vertices that will be received at superstep S + 1, and modify the state of V and its outgoing edges. Messages are typically sent along outgoing edges, but a message may be sent to any vertex whose identifier is known.

The above figure illustrates an example of nding the maximum value in a strongly connected graph using the BSP model. In this example, each superstep sets a new maximum value from incoming messages, send new value maximum to edges, otherwise vote to halt. Similar to the Hadoop framework, Pregel has been designed to be efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers where the distribution-related details are hidden behind an abstract. It keeps vertices and edges on the machine that performs computation and uses network transfers only for messages. Hence, the model is well suited for distributed implementations as it doesn't expose any mechanism for detecting order of execution within a superstep, and all communication is from superstep S to superstep S + 1.

The ideas of Pregel have been cloned by many open source projects such as GoldenOrb, Signal/Collect, Hama and Giraph. Both of Hama and Giraph are implemented to be launched as a typical Hadoop job that can leverage the Hadoop infrastructure.