Pig in Practice - Hadoop

There are some practical techniques that are worth knowing about when you are developing and running Pig programs. This section covers some of them.

Parallelism

When running in MapReduce mode, you need to tell Pig how many reducers you want for each job. You do this using a PARALLEL clause for operators that run in the reduce phase, which includes all the grouping and joining operators (GROUP, COGROUP, JOIN, CROSS), as well as DISTINCT and ORDER. By default, the number of reducers is one (just like for MapReduce), so it is important to set the degree of parallelism when running on a large dataset. The following line sets the number of reducers to 30 for the GROUP:

A good setting for the number of reduce tasks is slightly fewer than the number of reduce slots in the cluster. See “Choosing the Number of Reducers” for further discussion.

The number of map tasks is set by the size of the input (with one map per HDFS block) and is not affected by the PARALLEL clause.

Parameter Substitution

If you have a Pig script that you run on a regular basis, then it’s quite common to want to be able to run the same script with different parameters. For example, a script that runs daily may use the date to determine which input files it runs over. Pig supports parameter substitution, where parameters in the script are substituted with values supplied at runtime. Parameters are denoted by identifiers prefixed with a $ character;for example, $input and $output are used in the following script to specify the input and output paths:

Parameters can be specified when launching Pig, using the -param option, one for each parameter:

You can also put parameters in a file and pass them to Pig using the -param_file option. For example, we can achieve the same result as the previous command by placing the parameter definitions in a file:

The pig invocation then becomes:

You can specify multiple parameter files using -param_file repeatedly. You can also use a combination of -param and -param_file options, and if any parameter is defined in both a parameter file and on the command line, the last value on the command line takes precedence.

Dynamic parameters

For parameters that are supplied using the -param option, it is easy to make the value dynamic by running a command or script. Many Unix shells support command substitution for a command enclosed in backticks, and we can use this to make the output directory date-based:

 % pig -param input=/user/tom/input/ncdc/micro-tab/sample.txt 
> -param output=/tmp/`date "+%Y-%m-%d"`/out
> ch11/src/main/pig/max_temp_param.pig

Pig also supports backticks in parameter files, by executing the enclosed command in a shell and using the shell output as the substituted value. If the command or scripts exits with a nonzero exit status, then the error message is reported and execution halts.

Backtick support in parameter files is a useful feature; it means that parameters can be defined in the same way if they are defined in a file or on the command line.

Parameter substitution processing

Parameter substitution occurs as a preprocessing step before the script is run. You can see the substitutions that the preprocessor made by executing Pig with the -dryrun option. In dry run mode, Pig performs parameter substitution and generates a copy of the original script with substituted values, but does not execute the script. You can inspect the generated script and check that the substitutions look sane (because theyare dynamically generated, for example) before running it in normal mode. At the time of this writing, Grunt does not support parameter substitution.


All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

Hadoop Topics