Big Data Dictionary

Dryad/DryadLinq

Dryad is a general-purpose distributed execution engine introduced by Microsoft for coarse-grain data-parallel applications. A Dryad application combines computational vertices with communication channels to form a data ow graph. Dryad runs the application by executing the vertices of this graph on a set of available computers, communicating as appropriate through les, TCP pipes and shared-memory FIFOs. The Dryad system allows the developer define control over the communication graph as well as the subroutines that live at its vertices. A Dryad application developer can specify an arbitrary directed acyclic graph to describe the applications communication patterns and express the data transport mechanisms (les, TCP pipes and shared-memory FIFOs) between the computation vertices. This direct specication of the graph gives the developer greater exibility to easily compose basic common operations, leading to a distributed analogue of piping together traditional Unix utilities such as grep, sort and head.

Dryad is notable for allowing graph vertices (and computations in general) to use an arbitrary number of inputs and outputs while MapReduce restricts all computations to take a single input set and generate a single output set. The overall structure of a Dryad job is determined by its communication flow. A job is a directed acyclic graph where each vertex is a program and edges represent data channels. It is a logical computation graph that is automatically mapped onto physical resources by the runtime. At run time each channel is used to transport a nite sequence of structured items. A Dryad job is coordinated by a process called the job manager that runs either within the cluster or on a user's workstation with network access to the cluster. The job manager contains the application-specific code to construct the job's communication graph along with library code to schedule the work across the available resources. All data is sent directly between vertices and thus the job manager is only responsible for control decisions and is not a bottleneck for any data transfers. Therefore, much of the simplicity of the Dryad scheduler and fault-tolerance model come from the assumption that vertices are deterministic.

Dryad has its own high-level language called DryadLINQ. It generalizes execution environments such as SQL and MapReduce in two ways:

Adopting an expressive data model of strongly typed .NET objects.
Supporting general-purpose imperative and declarative operations on datasets within a traditional high-level programming language.

DryadLINQ exploits LINQ (LanguageINtegrated Query, a set of .NET constructs for programming with datasets) to provide a powerful hybrid of declarative and imperative programming. The system is designed to provide exible and ecient distributed computation in any LINQ-enabled programming language including C#, VB and F#. Objects in DryadLINQ datasets can be of any .NET type, making it easy to compute with data such as image patches, vectors and matrices. In practice, a DryadLINQ program is a sequential program composed of LINQ expressions that perform arbitrary side-effect free transformations on datasets and can be written and debugged using standard .NET development tools. The DryadLINQ system automatically translates the data-parallel portions of the program into a distributed execution plan which is then passed to the Dryad execution platform. A commercial implementation of Dryad and DryadLINQ was released in 2011 under the name LINQ to HPC.