Configuring the Development Environment - Hadoop

The first step is to download the version of Hadoop that you plan to use and unpack it on your development machine (this is described in Appendix A). Then, in your favorite IDE, create a new project and add all the JAR files from the top level of the unpacked distribution and from the lib directory to the classpath. You will then be able to compile Java Hadoop programs and run them in local (standalone) mode within the IDE.

For Eclipse users, there is a plug-in available for browsing HDFS and launching MapReduce programs. Instructions are available on the Hadoop wiki at http://wiki.apache.org/hadoop/EclipsePlugIn. Alternatively, Karmasphere provides Eclipse and NetBeans plug-ins for developing and running MapReduce jobs and browsing Hadoop clusters.

Managing Configuration

When developing Hadoop applications, it is common to switch between running the application locally and running it on a cluster. In fact, you may have several clusters you work with, or you may have a local “pseudo-distributed” cluster that you like to test on (a pseudo-distributed cluster is one whose daemons all run on the local machine; setting up this mode is covered in Appendix A, too).

One way to accommodate these variations is to have Hadoop configuration files containing the connection settings for each cluster you run against, and specify which one you are using when you run Hadoop applications or tools. As a matter of best practice, it’s recommended to keep these files outside Hadoop’s installation directory tree, as this makes it easy to switch between Hadoop versions without duplicating or losing settings.

For the purposes of this book, we assume the existence of a directory called conf that contains three configuration files: hadoop-local.xml, hadoop-localhost.xml, and hadoop-cluster.xml (these are available in the example code for this book). Note that there is nothing special about the names of these files—they are just convenient ways to package up some configuration settings. (Compare this to Table in AppendixA, which sets out the equivalent server-side configurations.)

The hadoop-local.xml file contains the default Hadoop configuration for the default filesystem and the jobtracker:

The settings in hadoop-localhost.xml point to a namenode and a jobtracker both running on localhost:

Finally, hadoop-cluster.xml contains details of the cluster’s namenode and jobtracker addresses. In practice, you would name the file after the name of the cluster, rather than “cluster” as we have here:

You can add other configuration properties to these files as needed. For example, if you wanted to set your Hadoop username for a particular cluster, you could do it in the appropriate file.

Setting User Identity

The user identity that Hadoop uses for permissions in HDFS is determined by running the whoami command on the client system. Similarly, the group names are derived from the output of running groups.

If, however, your Hadoop user identity is different from the name of your user account on your client machine, then you can explicitly set your Hadoop username and group names by setting the hadoop.job.ugi property. The username and group names are specified as a comma-separated list of strings (e.g., preston,directors,inventors would set the username to preston and the group names to directors and inventors).

You can set the user identity that the HDFS web interface runs as by setting dfs.web.ugi using the same syntax. By default, it is webuser,webgroup, which is not a super user, so system files are not accessible through the web interface.

Notice that, by default, there is no authentication with this system. See “Security” for how to use Kerberos authentication with Hadoop

With this setup, it is easy to use any configuration with the -conf command-line switch. For example, the following command shows a directory listing on the HDFS server running in pseudo-distributed mode on localhost:

If you omit the -conf option, then you pick up the Hadoop configuration in the conf subdirectory under $HADOOP_INSTALL. Depending on how you set this up, this may be for a standalone setup or a pseudo-distributed cluster. Tools that come with Hadoop support the -conf option, but it’s also straightforward to make your programs (such as programs that run MapReduce jobs) support it, too, using the Tool interface.

GenericOptionsParser, Tool, and ToolRunner

Hadoop comes with a few helper classes for making it easier to run jobs from the command line. GenericOptionsParser is a class that interprets common Hadoop command-line options and sets them on a Configuration object for your application to use as desired. You don’t usually use GenericOptionsParser directly, as it’s more convenient to implement the Tool interface and run your application with theToolRunner, which uses GenericOptionsParser internally:

a very simple implementation of Tool, for printing the keys and values of all the properties in the Tool’s Configuration object.

An example Tool implementation for printing the properties in a Configuration

We make ConfigurationPrinter a subclass of Configured, which is an implementation of the Configurable interface. All implementations of Tool need to implement Configurable (since Tool extends it), and subclassing Configured is often the easiest way to achieve this. The run() method obtains the Configuration using Configurable’s getConf() method and then iterates over it, printing each property to standard output.

The static block makes sure that the HDFS and MapReduce configurations are picked up in addition to the core ones (which Configuration knows about already).

ConfigurationPrinter’s main() method does not invoke its own run() method directly. Instead, we call ToolRunner’s static run() method, which takes care of creating a Configuration object for the Tool, before calling its run() method. ToolRunner also uses a GenericOptionsParser to pick up any standard options specified on the command line and set them on the Configuration instance. We can see the effect of picking up the properties specified in conf/hadoop-localhost.xml by running the following command:

Which Properties Can I Set?

ConfigurationPrinter is a useful tool for telling you what a property is set to in your environment.

You can also see the default settings for all the public properties in Hadoop by looking in the docs directory of your Hadoop installation for HTML files called coredefault. html, hdfs-default.html and mapred-default.html. Each property has a description that explains what it is for and what values it can be set to.

Be aware that some properties have no effect when set in the client configuration. For example, if in your job submission you set mapred.tasktracker.map.tasks.maximum with the expectation that it would change the number of task slots for the tasktrackers running your job, then you would be disappointed, since this property only is only honored if set in the tasktracker’s mapred-site.html file. In general, you can tell the component
where a property should be set by its name, so the fact that mapred.task tracker.map.tasks.maximum starts with mapred.tasktracker gives you a clue that it can be set only for the tasktracker daemon. This is not a hard and fast rule, however, so in some cases you may need to resort to trial and error, or even reading the source.

We discuss many of Hadoop’s most important configuration properties throughout this book. You can find a configuration property reference on the book’s website at http://www.hadoopbook.com.

GenericOptionsParser also allows you to set individual properties. For example:

The -D option is used to set the configuration property with key color to the value yellow. Options specified with -D take priority over properties from the configuration files. This is very useful: you can put defaults into configuration files and then override them with the -D option as needed. A common example of this is setting the number of reducers for a MapReduce job via -D mapred.reduce.tasks=n. This will override thenumber of reducers set on the cluster or set in any client-side configuration files.

The other options that GenericOptionsParser and ToolRunner support are listed in Table You can find more on Hadoop’s configuration API in “The Configuration API” . Do not confuse setting Hadoop properties using the -D property=value option to GenericOptionsParser (and ToolRunner) with setting JVM system properties using the -Dproperty=value option to the java command. The syntax for JVM system properties does not allow any whitespace between the D and the property name, whereas GenericOptionsParser requires them to be separated by whitespace.

JVM system properties are retrieved from the java.lang.System class, whereas Hadoop properties are accessible only from a Configuration object. So, the following command will print nothing, since the System class is not used by ConfigurationPrinter:

% hadoop -Dcolor=yellow ConfigurationPrinter | grep color If you want to be able to set configuration through system properties, then you need to mirror the system properties of interest in the configuration file. See “Variable Expansion” for further discussion.

Parser-and-Toolrunner-options


All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

Hadoop Topics