Stratosphere is a system for parallel data analysis which comprises a programming model called Pact and an execution engine called Nephele. The Parallelization Contracts (PACTs) programming model is a generalization of map/reduce as it is based on a key/value data model and the concept of Parallelization Contracts (PACTs). A PACT consists of exactly one second-order function which is called Input Contract and an optional Output Contract. An Input Contract takes a first-order function with task-specific user code and one or more data sets as input parameters. The Input Contract invokes its associated first-order function with independent subsets of its input data in a data-parallel fashion. In this context, the two functions of map and reduce are just examples of the Input Contracts. Other example of Input Contracts include:
An Output Contract is an optional component of a PACT and gives guarantees about the data that is generated by the assigned user function. The set of Output Contracts include:
The above figure illustrates the system architecture of Nephele/PACT where a PACT program is submitted to the PACT Compiler which translates the program into a data flow execution plan which is then handed to the Nephele system for parallel execution. Hadoop distributed filesystem (HDFS) is used for storing both the input and the output data. Due to the declarative character of the PACT programming model, the PACT compiler can apply dierent optimization mechanisms and select from several execution plans with varying costs for a single PACT program. For example, the Match contract can be satisfied using either a repartition strategy which partitions all inputs by keys or a broadcast strategy that fully replicates one input to every partition of the other input. Choosing the right strategy can dramatically reduce network traffic and execution time. Therefore, the PACT compiler applies standard SQL optimization techniques where it exploits information provided by the Output Contracts and apply dierent cost-based optimization techniques. In particular, the optimizer generates a set of candidate execution plans in a bottomup fashion (starting from the data sources) where the more expensive plans are pruned using a set of interesting properties for the operators. These properties are also used to spare plans from pruning that come with an additional property that may amortize their cost overhead later.
Stratosphere provides Sopremo, a semantically rich operator model, and Meteor, an extensible query language that is grounded in Sopremo. Sopremo provides a programming framework that allows users to dene custom packages, the respective operators and their instantiations. Meteor's syntax is operator-oriented and uses a Json-like data model to support applications that analyze semi- and unstructured data. Meteor queries are then translated into data ow programs of operator instantiations which represent concrete implementations of the involved Sopremo operators. A main advantage of this approach is that the operator's semantics can be accessed at compile time and potentially be used for data ow optimization, or for detecting syntactically correct, but semantically erroneous queries.
There has been error in communication with Booktype server. Not sure right now where is the problem.
You should refresh this page.