English |  Español |  Français |  Italiano |  Português |  Русский |  Shqip

Big Data Dictionary


Stratosphere is a system for parallel data analysis which comprises a programming model called Pact and an execution engine called Nephele. The Parallelization Contracts (PACTs) programming model is a generalization of map/reduce as it is based on a key/value data model and the concept of Parallelization Contracts (PACTs). A PACT consists of exactly one second-order function which is called Input Contract and an optional Output Contract. An Input Contract takes a first-order function with task-specifi c user code and one or more data sets as input parameters. The Input Contract invokes its associated  first-order function with independent subsets of its input data in a data-parallel fashion. In this context, the two functions of map and reduce are just examples of the Input Contracts. Other example of Input Contracts include:

  • The Cross contract which operates on multiple inputs and builds a distributed Cartesian product over its input sets.
  • The CoGroup contract partitions each of its multiple inputs along the key. Independent subsets are built by combining equal keys of all inputs. 
  • The Match contract operates on multiple inputs. It matches key/value pairs from all input data sets with the same key (equivalent to the inner join operation).

An Output Contract is an optional component of a PACT and gives guarantees about the data that is generated by the assigned user function. The set of Output Contracts include:

  • The Same-Key contract where each key/value pair that is generated by the function has the same key as the key/value pair(s) from which it was generated. This means the function will preserve any partitioning and order property on the keys.
  • The Super-Key where each key/value pair that is generated by the function  has a superkey of the key/value pair(s) from which it was generated. This means the function will preserve a partitioning and partial order on the keys.
  • The Unique-Key where each key/value pair that is produced has a unique key. The key must be unique across all parallel instances. Any produced data is therefore partitioned and grouped by the key.
  • The Partitioned-by-Key where key/value pairs are partitioned by key. This contract has similar implications as the Super-Key contract, speci cally that a partitioning by the keys is given, but there is no order inside the partitions.

The above figure illustrates the system architecture of Nephele/PACT where a PACT program is submitted to the PACT Compiler which translates the program into a data flow execution plan which is then handed to the Nephele system for parallel execution. Hadoop distributed fi lesystem (HDFS) is used for storing both the input and the output data. Due to the declarative character of the PACT programming model, the PACT compiler can apply di erent optimization mechanisms and select from several execution plans with varying costs for a single PACT program. For example, the Match contract can be satisfi ed using either a repartition strategy which partitions all inputs by keys or a broadcast strategy that fully replicates one input to every partition of the other input. Choosing the right strategy can dramatically reduce network traffic and execution time. Therefore, the PACT compiler applies standard SQL optimization techniques where it exploits information provided by the Output Contracts and apply di erent cost-based optimization techniques. In particular, the optimizer generates a set of candidate execution plans in a bottomup fashion (starting from the data sources) where the more expensive plans are pruned using a set of interesting properties for the operators. These properties are also used to spare plans from pruning that come with an additional property that may amortize their cost overhead later.

Stratosphere provides Sopremo, a semantically rich operator model, and Meteor, an extensible query language that is grounded in Sopremo. Sopremo provides a programming framework that allows users to de ne custom packages, the respective operators and their instantiations. Meteor's syntax is operator-oriented and uses a Json-like data model to support applications that analyze semi- and unstructured data. Meteor queries are then translated into data ow programs of operator instantiations which represent concrete implementations of the involved Sopremo operators. A main advantage of this approach is that the operator's semantics can be accessed at compile time and potentially be used for data ow optimization, or for detecting syntactically correct, but semantically erroneous queries.

There has been error in communication with Booktype server. Not sure right now where is the problem.

You should refresh this page.