Big Data Dictionary

Spark

The Spark system have been proposed to support the applications which need to reuse a working set of data across multiple parallel operations (e.g. iterative machine learning algorithms and interactive data analytic) while retaining the scalability and fault tolerance of MapReduce. To achieve these goals, Spark introduces an abstraction called resilient distributed datasets (RDDs). An RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. Therefore, users can explicitly cache an RDD in memory across machines and reuse it in multiple MapReduce-like parallel operations. RDDs do not need to be materialized at all times. RDDs achieve fault tolerance through a notion of lineage. In particular, each RDD object contains a pointer to its parent and information about how the parent was transformed. Hence, if a partition of an RDD is lost, the RDD has sucient information about how it was derived from other RDDs to be able to rebuild just that partition.

Spark is implemented in the Scala programming language. It is built on top of Mesos, a cluster operating system that lets multiple parallel frameworks share a cluster in a ne-grained manner and provides an API for applications to launch tasks on a cluster. It provides isolation and efficient resource sharing across frameworks running on the same cluster while giving each framework freedom to implement its own programming model and fully control the execution of its jobs. Mesos uses two main abstractions: tasks and slots. A task represents a unit of work. A slot represents a computing resource in which a framework may run a task, such as a core and some associated memory on a multicore machine. It employs the two-level scheduling mechanism. At the first level, Mesos allocates slots between frameworks using fair sharing. At the second level, each framework is responsible for dividing its work into tasks, selecting which tasks to run in each slot. This lets frameworks perform application-specic optimizations. For example Spark's scheduler tries to send each task to one of its preferred locations using a technique called delay scheduling.

To use Spark, developers need to write a driver program that implements the high-level control ow of their application and launches various operations in parallel. Spark provides two main abstractions for parallel programming: resilient distributed datasets and parallel operations on these datasets (invoked by passing a function to apply on a dataset). Dierent parallel operations can be performed on RDDs:

The reduce operation which combines dataset elements using an associative function to produce a result at the driver program.
The collect operation which sends all elements of the dataset to the driver program.
The foreach which passes each element through a user provided function.

Spark does not currently support a grouped reduce operation as in MapReduce. The results of reduce operations are only collected at the driver process. In addition, Spark supports two restricted types of shared variables to support two simple but common usage patterns:

Broadcast variables: An object that wraps the value and ensures that it is only copied to each worker once.
Accumulators: These are variables that workers can only add to using an associative operation, and that only the driver can read.

It should be noted that Spark can be also used interactively from a modied version of the Scala interpreter which allows the user to dene RDDs, functions, variables and classes and use them in parallel operations on a cluster.