English |  Español |  Français |  Italiano |  Português |  Русский |  Shqip

Big Data Dictionary

Pig

Pig Latin is a language that takes a middle position between expressing task using the high-level declarative querying model in the spirit of SQL and the low-level/procedural programming model using MapReduce. Pig Latin is implemented in the scope of the Pig project and is used by programmers at Yahoo! for developing data analysis tasks. Writing a Pig Latin program is similar to specifying a query execution plan (e.g. a data ow graph). To experienced programmers, this method is more appealing than encoding their task as an SQL query and then coercing the system to choose the desired plan through optimizer hints. In general, automatic query optimization has its limits especially with uncataloged data, prevalent user-de ned functions and parallel execution, which are all features of the data analysis tasks targeted by the MapReduce framework. 

Pig Latin has several other features that are important for casual ad-hoc data analysis tasks. These features include support for a exible, fully nested data model, extensive support for user-de ned functions and the ability to operate over plain input les without any schema information. In particular, Pig Latin has a simple data model consisting of the following four types:

  1. Atom: An atom contains a simple atomic value such as a string or a number, e.g. "alice".
  2. Tuple: A tuple is a sequence of elds, each of which can be any of the data types, e.g. ("alice", "lakers").
  3. Bag: A bag is a collection of tuples with possible duplicates. The schema of the constituent tuples is exible where not all tuples in a bag need to have the same number and type of fi elds.
  4. Map: A map is a collection of data items, where each item has an associated key through which it can be looked up. As with bags, the schema of the constituent data items is exible However, the keys are required to be data atoms.

Pig Latin provides operators for many of the traditional data operations (e.g., join, sort, lter, group by, union) as well as the ability for users to develop their own functions to read, process, and write data. MapReduce provides the group by operation directly (of which the shue plus reduce phases essentially are), and it provides the order by operation indirectly through the way it implements the grouping. Filter and projection operations can be implemented trivially in the map phase. But other operators, particularly join, are not provided and must instead be written by the user. Pig provides some complex, nontrivial implementations of these standard data operations. For example, because the number of records per key in a dataset is rarely evenly distributed, the data sent to the reducers is often skewed. That is, one reducer will get 10 or more times the data than other reducers. Pig has join and order by operators that will handle this case and (in some cases) rebalance the reducers.

Traditionally, ad-hoc queries are done in languages such as SQL that make it easy to quickly form a question for the data to answer. For research on raw data, some users prefer Pig Latin. Because Pig can operate in situations where the schema is unknown, incomplete, or inconsistent, and because it can easily manage nested data, researchers who want to work on data before it has been cleaned and loaded into the warehouse often prefer Pig. Researchers who work with large data sets often use scripting languages such as Perl or Python to do their processing. Users with these backgrounds often prefer the data ow paradigm of Pig over the declarative query paradigm of SQL.

There has been error in communication with Booktype server. Not sure right now where is the problem.

You should refresh this page.