An Example to write Hadoop - Hadoop

Let’s look at a simple example by writing the program to calculate the maximum recorded temperature by year for the weather dataset in Pig Latin (just like we did using MapReduce in mapReduce Chapter ). The complete program is only a few lines long:

To explore what’s going on, we’ll use Pig’s Grunt interpreter, which allows us to enter lines and interact with the program to understand what it’s doing. Start up Grunt in local mode, then enter the first line of the Pig script:

For simplicity, the program assumes that the input is tab-delimited text, with each line having just year, temperature, and quality fields. (Pig actually has more flexibility than this with regard to the input formats it accepts, as you’ll see later.) This line describes the input data we want to process. The year:chararray notation describes the field’s name and type; a chararray is like a Java string, and an int is like a Java int. The LOADoperator takes a URI argument; here we are just using a local file, but we could refer to an HDFS URI. The AS clause (which is optional) gives the fields names to make it convenient to refer to them in subsequent statements.

The result of the LOAD operator, indeed any operator in Pig Latin, is a relation, which is just a set of tuples. A tuple is just like a row of data in a database table, with multiple fields in a particular order. In this example, the LOAD function produces a set of (year, temperature, quality) tuples that are present in the input file. We write a relation with one tuple per line, where tuples are represented as comma-separated items inparentheses:

Relations are given names, or aliases, so they can be referred to. This relation is given the records alias. We can examine the contents of an alias using the DUMP operator:

We can also see the structure of a relation the relation’s schema using the DESCRIBE operator on the relation’s alias:

This tells us that records has three fields, with aliases year, temperature, and quality, which are the names we gave them in the AS clause. The fields have the types given to them in the AS clause, too. We shall examine types in Pig in more detail later.

The second statement removes records that have a missing temperature (indicated by a value of 9999) or an unsatisfactory quality reading. For this small dataset, no records are filtered out:

The third statement uses the GROUP function to group the records relation by the year field. Let’s use DUMP to see what it produces:

We now have two rows, or tuples, one for each year in the input data. The first field in each tuple is the field being grouped by (the year), and the second field is a bag of tuples for that year. A bag is just an unordered collection of tuples, which in Pig Latin is represented using curly braces.

By grouping the data in this way, we have created a row per year, so now all that remains is to find the maximum temperature for the tuples in each bag. Before we do this, let’s understand the structure of the grouped_records relation:

This tells us that the grouping field is given the alias group by Pig, and the second field is the same structure as the filtered_records relation that was being grouped. With this information, we can try the fourth transformation:

FOREACH processes every row to generate a derived set of rows, using a GENERATE clause to define the fields in each derived row. In this example, the first field is group, which is just the year. The second field is a little more complex.

The filtered_records.temperature reference is to the temperature field of the filtered_records bag in the grouped_records relation. MAX is a built-in function for calculating the maximum value of fields in a bag. In this case, it calculates the maximum temperature for the fields in each filtered_records bag. Let’s check the result:

So we’ve successfully calculated the maximum temperature for each year.

Generating Examples

In this example, we’ve used a small sample dataset with just a handful of rows to make it easier to follow the data flow and aid debugging. Creating a cut-down dataset is an art, as ideally it should be rich enough to cover all the cases to exercise your queries (the completeness property), yet be small enough to reason about by the programmer (the conciseness property). Using a random sample doesn’t work well in general, sincejoin and filter operations tend to remove all random data, leaving an empty result, which is not illustrative of the general flow.

With the ILLUSTRATE operator, Pig provides a tool for generating a reasonably complete and concise dataset. Although it can’t generate examples for all queries (it doesn’t support LIMIT, SPLIT, or nested FOREACH statements, for example), it can generate useful examples for many queries. ILLUSTRATE works only if the relation has a schema.

Here is the output from running ILLUSTRATE (slightly reformatted to fit the page):

Generating Examples

Notice that Pig used some of the original data (this is important to keep the generated dataset realistic), as well as creating some new data. It noticed the special value 9999 in the query and created a tuple containing this value to exercise the FILTER statement.

In summary, the output of the ILLUSTRATE is easy to follow and can help you understand what your query is doing


All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

Hadoop Topics