Pig Latin - Hadoop

This section gives an informal description of the syntax and semantics of the Pig Latin programming language. It is not meant to offer a complete reference to the language,§ but there should be enough here for you to get a good understanding of Pig Latin’s constructs.

Structure

A Pig Latin program consists of a collection of statements. A statement can be thought of as an operation,or a command.For example,a GROUP operation is a type of statement:

Statements are usually terminated with a semicolon, as in the example of the GROUP statement. In fact, this is an example of a statement that must be terminated with a semicolon: it is a syntax error to omit it. The ls command, on the other hand, does not have to be terminated with a semicolon. As a general guideline, statements or commands for interactive use in Grunt do not need the terminating semicolon. This group includes the interactive Hadoop commands, as well as the diagnostic operators like DESCRIBE. It’s never an error to add a terminating semicolon, so if in doubt, it’s simplest to add one.

Statements that have to be terminated with a semicolon can be split across multiple lines for readability:

Pig Latin has two forms of comments. Double hyphens are single-line comments. Everything from the first hyphen to the end of the line is ignored by the Pig Latin interpreter:

Not to be confused with Pig Latin, the language game. English words are translated into Pig Latin by moving the initial consonant sound to the end of the word and adding an “ay” sound. For example, “pig” becomes“ig-pay,” and “Hadoop” becomes “Adoop-hay.”

Pig Latin does not have a formal language definition as such, but there is a comprehensive guide to the language that can be found linked to from the Pig wiki at http://wiki.apache.org/pig/.

You sometimes see these terms being used interchangeably in documentation on Pig Latin. For example, “GROUP command, ” “GROUP operation,” “GROUP statement.”

C-style comments are more flexible since they delimit the beginning and end of the comment block with /* and */ markers. They can span lines or be embedded in a single line:

Pig Latin has a list of keywords that have a special meaning in the language and cannot be used as identifiers. These include the operators (LOAD, ILLUSTRATE), commands (cat, ls), expressions (matches, FLATTEN), and functions (DIFF, MAX) all of which are covered in the following sections.

Pig Latin has mixed rules on case sensitivity. Operators and commands are not casesensitive (to make interactive use more forgiving); however, aliases and function names are case-sensitive.

Statements

As a Pig Latin program is executed, each statement is parsed in turn. If there are syntax errors, or other (semantic) problems such as undefined aliases, the interpreter will halt and display an error message. The interpreter builds a logical plan for every relational operation, which forms the core of a Pig Latin program. The logical plan for the statement is added to the logical plan for the program so far, then the interpreter moves on to the next statement.

It’s important to note that no data processing takes place while the logical plan of the program is being constructed. For example, consider again the Pig Latin program from the first example:

When the Pig Latin interpreter sees the first line containing the LOAD statement, it confirms that it is syntactically and semantically correct, and adds it to the logical plan, but it does not load the data from the file (or even check whether the file exists). Indeed, where would it load it? Into memory? Even if it did fit into memory, what would it do with the data? Perhaps not all the input data is needed (since later statements filter it,for example), so it would be pointless to load it. The point is that it makes no sense to start any processing until the whole flow is defined. Similarly, Pig validates the GROUP and FOREACH...GENERATE statements, and adds them to the logical plan without executing them. The trigger for Pig to start execution is the DUMP statement. At that point, the logical plan is compiled into a physical plan and executed.

Multiquery execution

Since DUMP is a diagnostic tool, it will always trigger execution. However, the STORE command is different. In interactive mode, STORE acts like DUMP and will always trigger execution (this includes the run command), but in batch mode it will not (this includes the exec command).The reason for this is efficiency. In batch mode, Pig will parse the whole script to see if there are any optimizations that could be made to limitthe amount of data to be written to or read from disk. Consider the following simple example:

Relations B and C are both derived from A, so to save reading A twice, Pig can run this script as a single MapReduce job by reading A once and writing two output files from the job, one for each of B and C. This feature is called multiquery execution.

In previous versions of Pig that did not have multiquery execution, each STORE statement in a script run in batch mode triggered execution, resulting in a job for each STORE statement. It is possible to restore the old behavior by disabling multiquery execution with the -M or -no_multiquery option to pig.

The physical plan that Pig prepares is a series of MapReduce jobs, which in local mode Pig runs in the local JVM, and in MapReduce mode Pig runs on a Hadoop cluster.

You can see the logical and physical plans created by Pig using the EXPLAIN command on a relation (EXPLAIN max_temp; for example).

EXPLAIN will also show the MapReduce plan, which shows how the physical operators are grouped into MapReduce jobs. This is a good way to find out how many MapReduce jobs Pig will run for your query.

The relational operators that can be a part of a logical plan in Pig are summarized in Table . We shall go through the operators in more detail in “Data Processing Operators”.

pig latin relational operators

There are other types of statements that are not added to the logical plan. For example, the diagnostic operators, DESCRIBE, EXPLAIN, and ILLUSTRATE are provided to allow the user to interact with the logical plan, for debugging purposes (see Table ). DUMP is a sort of diagnostic operator, too, since it is used only to allow interactive debugging of small result sets or in combination with LIMIT to retrieve a few rows from a larger relation. The STORE statement should be used when the size of the output is more than a few lines, as it writes to a file, rather than to the console.

Pig Latin provides two statements, REGISTER and DEFINE, to make it possible to incorporate user-defined functions into Pig scripts (see Table)

pig latin diagnostic operators

Since they do not process relations, commands are not added to the logical plan; instead, they are executed immediately. Pig provides commands to interact with Hadoop filesystems (which are very handy for moving data around before or after processing with Pig) and MapReduce, as well as a few utility commands (described in Table).

pig latin diagnostic operators

The filesystem commands can operate on files or directories in any Hadoop filesystem, and they are very similar to the hadoop fs commands (which is not surprising, as both are simple wrappers around the Hadoop FileSystem interface). You can access all of the Hadoop filesystem shell commands using Pig’s fs command. For example, fs -ls will show a file listing, and fs -help will show help on all the available commands.

pig latin commands

Precisely which Hadoop filesystem is used is determined by the fs.default.name property in the site file for Hadoop Core. See “The Command-Line Interface” . for more details on how to configure this property. These commands are mostly self-explanatory, except set, which is used to set options that control Pig’s behavior. The debug option is used to turn debug logging on or off from within a script (you can also control the log level when launching Pig, using the -d or -debug option):

Another useful option is the job.name option, which gives a Pig job a meaningful name, making it easier to pick out your Pig MapReduce jobs when running on a shared Hadoop cluster. If Pig is running a script (rather than being an interactive query from Grunt), its job name defaults to a value based on the script name.

There are two commands in Table for running a Pig script, exec and run. The difference is that exec runs the script in batch mode in a new Grunt shell, so any aliases defined in the script are not accessible to the shell after the script has completed. On the other hand, when running a script with run, it is as if the contents of the script had been entered manually, so the command history of the invoking shell contains all the statements from the script. Multiquery execution, where Pig executes a batch of statements in one go (see “Multiquery execution” ), is only used by exec, not run.

Expressions

An expression is something that is evaluated to yield a value. Expressions can be used in Pig as a part of a statement containing a relational operator. Pig has a rich variety of expressions, many of which will be familiar from other programming languages. They are listed in Table , with brief descriptions and examples. We shall see examples of many of these expressions throughout the chapter.

Table . Pig Latin expressions

Pig Latin expressions

Types

So far you have seen some of the simple types in Pig, such as int and chararray. Here we will discuss Pig’s built-in types in more detail.

Pig has four numeric types: int, long, float, and double, which are identical to their Java counterparts. There is also a bytearray type, like Java’s byte array type for representing a blob of binary data, and chararray, which, like java.lang.String, represents textual data in UTF-16 format, although it can be loaded or stored in UTF-8 format.

Pig does not have types corresponding to Java’s boolean,# byte, short, or char primitive types. These are all easily represented using Pig’s int type, or chararray for char. The numeric, textual, and binary types are simple atomic types. Pig Latin also has three complex types for representing nested structures: tuple, bag, and map. All of Pig Latin’s types are listed in Table .

pig latin types

The complex types are usually loaded from files or constructed using relational operators. Be aware, however, that the literal form in Table is used when a constant value is created from within a Pig Latin program. The raw form in a file is usually different when using the standard PigStorage loader. For example, the representation in a file of the bag in Table would be {(1,pomegranate),(2)} (note the lack of quotes), and with a suitable schema, this would be loaded as a relation with a single field and row, whose value was the bag.

Maps are always loaded from files, since there is no relational operator in Pig that produces a map. It’s possible to write a UDF to generate maps, if desired. Although relations and bags are conceptually the same (an unordered collection of tuples), in practice Pig treats them slightly differently. A relation is a top-level construct, whereas a bag has to be contained in a relation. Normally, you don’t have to worry about this, but there are a few restrictions that can trip up the uninitiated. For example, it’s not possible to create a relation from a bag literal. So the following statement fails:

The simplest workaround in this case is to load the data from a file using the LOAD statement. As another example, you can’t treat a relation like a bag and project a field into a new relation ($0 refers to the first field of A, using the positional notation):

Instead, you have to use a relational operator to turn the relation A into relation B:

It’s possible that a future version of Pig Latin will remove these inconsistencies and treat relations and bags in the same way.

Schemas

A relation in Pig may have an associated schema, which gives the fields in the relation names and types. We’ve seen how an AS clause in a LOAD statement is used to attach a schema to a relation:

This time we’ve declared the year to be an integer, rather than a chararray, even though the file it is being loaded from is the same. An integer may be more appropriate if we needed to manipulate the year arithmetically (to turn it into a timestamp, for example), whereas the chararray representation might be more appropriate when it’s being used as a simple identifier. Pig’s flexibility in the degree to which schemas are declared contrasts with schemas in traditional SQL databases, which are declared before the data is loaded into to the system. Pig is designed for analyzing plain input files with no associated type information, so it is quite natural to choose types for fields later than you would with an RDBMS.

It’s possible to omit type declarations completely, too:

In this case, we have specified only the names of the fields in the schema, year, temperature, and quality. The types default to bytearray, the most general type, representing a binary string.

You don’t need to specify types for every field; you can leave some to default to byte array, as we have done for year in this declaration:

However, if you specify a schema in this way, you do need to specify every field. Also, there’s no way to specify the type of a field without specifying the name. On the other hand, the schema is entirely optional and can be omitted by not specifying an AS clause:

Schema for records unknown.

Fields in a relation with no schema can be referenced only using positional notation: $0 refers to the first field in a relation, $1 to the second, and so on. Their types default to bytearray:

Although it can be convenient not to have to assign types to fields (particularly in the first stages of writing a query), doing so can improve the clarity and efficiency of Pig Latin programs, and is generally recommended.Declaring a schema as a part of the query is flexible, but doesn’t lend itself to schema reuse. A set of Pig queries over the same input data will often have the same schema repeated in each query. If the query processes a large number of fields, this repetition can become hard to maintain, since Pig (unlike Hive) doesn’t have a way to associate a schema with data outside of a query. One way to solve this problem is to write your own load function, which encapsulates the schema. This is described in more detail in “A Load UDF” .

Validation and nulls

An SQL database will enforce the constraints in a table’s schema at load time: for example, trying to load a string into a column that is declared to be a numeric type will fail. In Pig, if the value cannot be cast to the type declared in the schema, then it will substitute a null value. Let’s see how this works if we have the following input for the weather data, which has an “e” character in place of an integer:

Pig handles the corrupt line by producing a null for the offending value, which is displayed as the absence of a value when dumped to screen (and also when saved using STORE):

Pig produces a warning for the invalid field (not shown here), but does not halt its processing. For large datasets, it is very common to have corrupt, invalid, or merely unexpected data, and it is generally infeasible to incrementally fix every unparsable record. Instead, we can pull out all of the invalid records in one go, so we can take action on them, perhaps by fixing our program (because they indicate we have made a mistake) or by filtering them out (because the data is genuinely unusable):

Note the use of the is null operator, which is analogous to SQL. In practice, we would include more information from the original record, such as an identifier and the value that could not be parsed, to help our analysis of the bad data. We can find the number of corrupt records using the following idiom for counting the number of rows in a relation:

##SYTART## grunt> grouped = GROUP corrupt_records ALL;
grunt> all_grouped = FOREACH grouped GENERATE group, COUNT(corrupt_records);
grunt> DUMP all_grouped;
(all,1L)

Another useful technique is to use the SPLIT operator to partition the data into “good” and “bad” relations, which can then be analyzed separately:

Going back to the case in which temperature’s type was left undeclared, the corrupt data cannot be easily detected, since it doesn’t surface as a null:

What happens in this case is that the temperature field is interpreted as a bytearray, so the corrupt field is not detected when the input is loaded. When passed to the MAX function, the temperature field is cast to a double, since MAX works only with numeric types. The corrupt field can not be represented as a double, so it becomes a null, which MAX silently ignores. The best approach is generally to declare types for your data on loading, and look for missing or corrupt values in the relations themselves before you do your main processing.

Sometimes corrupt data shows up as smaller tuples since fields are simply missing. You can filter these out by using the SIZE function as follows:

Schema merging

In Pig, you don’t declare the schema for every new relation in the data flow. In most cases, Pig can figure out the resulting schema for the output of a relational operation by considering the schema of the input relation.

How are schemas propagated to new relations? Some relational operators don’t change the schema, so the relation produced by the LIMIT operator (which restricts a relation to a maximum number of tuples), for example, has the same schema as the relation it operates on. For other operators, the situation is more complicated. UNION, for example, combines two or more relations into one, and tries to merge the input relations schemas. If the schemas are incompatible, due to different types or number of fields, then the schema of the result of the UNION is unknown.

You can find out the schema for any relation in the data flow using the DESCRIBE operator. If you want to redefine the schema for a relation, you can use the FOREACH...GENERATE operator with AS clauses to define the schema for some or all of the fields of the input relation.See “User-Defined Functions” for further discussion of schemas.

Functions

Functions in Pig come in four types:

Eval function

A function that takes one or more expressions and returns another expression. An example of a built-in eval function is MAX, which returns the maximum value of the entries in a bag. Some eval functions are aggregate functions, which means they operate on a bag of data to produce a scalar value; MAX is an example of an aggregate function. Furthermore, many aggregate functions are algebraic, which means that the result of the function may be calculated incrementally. In MapReduce terms, algebraic functions make use of the combiner and are much more efficient to calculate (see “Combiner Functions” ). MAX is an algebraic function, whereas a function to calculate the median of a collection of values is an example of a function that is not algebraic.

Filter function

A special type of eval function that returns a logical boolean result. As the name suggests, filter functions are used in the FILTER operator to remove unwanted rows. They can also be used in other relational operators that take boolean conditions and, in general, expressions using boolean or conditional expressions. An example of a built-in filter function is IsEmpty, which tests whether a bag or a map contains any items.

Load function

A function that specifies how to load data into a relation from external storage.

Store function

A function that specifies how to save the contents of a relation to external storage. Often, load and store functions are implemented by the same type. For example, PigStorage, which loads data from delimited text files, can store data in the same format.

Pig has a small collection of built-in functions, which are listed in Table

Table Pig built-in functions

pig-built-in-functions-

If the function you need is not available, you can write your own. Before you do that, however, have a look in the Piggy Bank, a repository of Pig functions shared by the Pig community. There are details on the Pig wiki at http://wiki.apache.org/pig/PiggyBank on how to browse and obtain the Piggy Bank functions. If the Piggy Bank doesn’t have what you need, you can write your own function (and if it is sufficiently general, youmight consider contributing it to the Piggy Bank so that others can benefit from it, too). These are known as user-defined functions, or UDFs.


All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

Hadoop Topics