Installing and Running Pig - Hadoop

Installing and Running Pig

Pig runs as a client-side application. Even if you want to run Pig on a Hadoop cluster, there is nothing extra to install on the cluster: Pig launches jobs and interacts with HDFS (or other Hadoop filesystems) from your workstation.

Installation is straightforward. Java 6 is a prerequisite (and on Windows, you will need Cygwin). Download a stable release from http://hadoop.apache.org/pig/releases.html, and unpack the tarball in a suitable place on your workstation:

It’s convenient to add Pig’s binary directory to your command-line path. For example:

You also need to set the JAVA_HOME environment variable to point to a suitable Java installation.

Try typing pig -help to get usage instructions.

Execution Types

Pig has two execution types or modes: local mode and MapReduce mode.

Local mode

In local mode, Pig runs in a single JVM and accesses the local filesystem. This mode is suitable only for small datasets and when trying out Pig.

The execution type is set using the -x or -exectype option. To run in local mode, set the option to local:

This starts Grunt, the Pig interactive shell, which is discussed in more detail shortly.

MapReduce mode

In MapReduce mode, Pig translates queries into MapReduce jobs and runs them on a Hadoop cluster. The cluster may be a pseudo- or fully distributed cluster. MapReduce mode (with a fully distributed cluster) is what you use when you want to run Pig on large datasets.

To use MapReduce mode, you first need to check that the version of Pig you downloaded is compatible with the version of Hadoop you are using. Pig releases will only work against particular versions of Hadoop; this is documented on the releases page. For example, Pig 0.3 and 0.4 run against a Hadoop 0.18.x release, while Pig 0.5 to 0.7 work with Hadoop 0.20.x.

If a Pig release supports multiple versions of Hadoop, you can use the environment variable PIG_HADOOP_VERSION to tell Pig the version of Hadoop it is connecting to. For example, the following makes Pig use any 0.18.x version of Hadoop:

Next, you need to point Pig at the cluster’s namenode and jobtracker. If you already have a Hadoop site file (or files) that define fs.default.name and mapred.job.tracker, you can simply add Hadoop’s configuration directory to Pig’s classpath:

Alternatively, you can set these two properties in the pig.properties file in Pig’s conf directory. Here’s an example for a pseudo-distributed setup:

Once you have configured Pig to connect to a Hadoop cluster, you can launch Pig, setting the -x option to mapreduce, or omitting it entirely, as MapReduce mode is the default:

As you can see from the output, Pig reports the filesystem and jobtracker that it has connected to.

Running Pig Programs

There are three ways of executing Pig programs, all of which work in both local and MapReduce mode:

Script

Pig can run a script file that contains Pig commands. For example, pig script.pig runs the commands in the local file script.pig. Alternatively, for very short scripts, you can use the -e option to run a script specified as a string on the command line.

Grunt

Grunt is an interactive shell for running Pig commands. Grunt is started when no file is specified for Pig to run, and the -e option is not used. It is also possible to run Pig scripts from within Grunt using run and exec.

Embedded

You can run Pig programs from Java, much like you can use JDBC to run SQL programs from Java.

There are more details on the Pig wiki at http://wiki.apache .org/pig/EmbeddedPig.

Grunt

Grunt has line-editing facilities like those found in GNU Readline (used in the bash shell and many other command-line applications). For instance, the Ctrl-E key combination will move the cursor to the end of the line. Grunt remembers command history, too,* and you can recall lines in the history buffer using Ctrl-P or Ctrl-N (for previous and next) or, equivalently, the up or down cursor keys.

Another handy feature is Grunt’s completion mechanism, which will try to complete Pig Latin keywords and functions when you press the Tab key. For example, consider the following incomplete line:


If you press the Tab key at this point, ge will expand to generate, a Pig Latin keyword:

You can customize the completion tokens by creating a file named autocomplete and placing it on Pig’s classpath (such as in the conf directory in Pig’s install directory), or

* History is stored in a file called .pig_history in your home directory.

in the directory you invoked Grunt from. The file should have one token per line, and tokens must not contain any whitespace. Matching is case-sensitive. It can be very handy to add commonly used file paths (especially because Pig does not perform filename completion) or the names of any user-defined functions you have created. You can get a list of commands using the help command. When you’ve finished yourGrunt session, you can exit with the quit command.

Pig Latin Editors

PigPen is an Eclipse plug-in that provides an environment for developing Pig programs. It includes a Pig script text editor, an example generator (equivalent to the ILLUSTRATE command), and a button for running the script on a Hadoop cluster. There is also an operator graph window, which shows a script in graph form, for visualizing the data flow. For full installation and usage instructions, please refer to the Pig wiki athttp://wiki.apache.org/pig/PigPen.

There are also Pig Latin syntax highlighters for other editors, including Vim and Text- Mate. Details are available on the Pig wiki.


All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

Hadoop Topics