Running Locally on Test Data - Hadoop

Now that we’ve got the mapper and reducer working on controlled inputs, the next step is to write a job driver and run it on some test data on a development machine.

Running a Job in a Local Job Runner

Using the Tool interface introduced earlier in the chapter, it’s easy to write a driver to run our MapReduce job for finding the maximum temperature by year (see MaxTemperatureDriver in below).

MaxTemperatureDriver implements the Tool interface, so we get the benefit of being able to set the options that GenericOptionsParser supports. The run() method constructs and configures a JobConf object, before launching a job described by the JobConf. Among the possible job configuration parameters, we set the input and output file paths, the mapper, reducer and combiner classes, and the output types (the input typesare determined by the input format, which defaults to TextInputFormat and has Long Writable keys and Text values). It’s also a good idea to set a name for the job so that you can pick it out in the job list during execution and after it has completed. By default, the name is the name of the JAR file, which is normally not particularly descriptive.

Now we can run this application against some local files. Hadoop comes with a local job runner, a cut-down version of the MapReduce execution engine for running Map- Reduce jobs in a single JVM. It’s designed for testing and is very convenient for use in an IDE, since you can run it in a debugger to step through the code in your mapper and reducer.

The local job runner is only designed for simple testing of MapReduce programs, so inevitably it differs from the full MapReduce implementation. The biggest difference is that it can’t run more than one reducer.(It can support the zero reducer case, too.) This is normally not a problem, as most applications can work with one reducer, although on a cluster you would choose a larger number to take advantage of parallelism. The thing to watch out for is that even if you set the number of reducers to a value over one, the local runner will silently ignore the setting and use a single reducer.

The local job runner also has no support for the DistributedCache feature (described in “Distributed Cache” ).

Neither of these limitations is inherent in the local job runner, and future versions of Hadoop may relax these restrictions.

The local job runner is enabled by a configuration setting. Normally, mapred.job.tracker is a host:port pair to specify the address of the jobtracker, but when it has the special value of local, the job is run in-process without accessing an external jobtracker

. From the command line, we can run the driver by typing:

Equivalently, we could use the -fs and -jt options provided by GenericOptionsParser:

This command executes MaxTemperatureDriver using input from the local input/ncdc/ micro directory, producing output in the local max-temp directory. Note that although we’ve set -fs so we use the local filesystem (file:///), the local job runner will actually work fine against any filesystem, including HDFS (and it can be handy to do this if you have a few files that are on HDFS).

When we run the program, it fails and prints the following exception:

Fixing the mapper

This exception shows that the map method still can’t parse positive temperatures. (If the stack trace hadn’t given us enough information to diagnose the fault, we could run the test in a local debugger, since it runs in a single JVM.) Earlier, we made it handle the special case of missing temperature, +9999, but not the general case of any positive temperature. With more logic going into the mapper, it makes sense to factor out aparser class to encapsulate the parsing logic; (now on version 3).

The resulting mapper is much simpler. It just calls the parser’s parse() method, which parses the fields of interest from a line of input, checks whether a valid temperature was found using the isValidTemperature() query method, and if it was, retrieves the year and the temperature using the getter methods on the parser. Notice that we also check the quality status field as well as missing temperatures in isValidTemperature() to filter out poor temperaturereadings.

Another benefit of creating a parser class is that it makes it easy to write related mappers for similar jobs without duplicating code. It also gives us the opportunity to write unit tests directly against the parser, for more targeted testing.

With these changes, the test passes.

Testing the Driver

Apart from the flexible configuration options offered by making your application implement Tool, you also make it more testable because it allows you to inject an arbitrary Configuration. You can take advantage of this to write a test that uses a local job runner to run a job against known input data, which checks that the output is as expected.

There are two approaches to doing this. The first is to use the local job runner and run the job against a test file on the local filesystem. The code gives an idea of how to do this.

The test explicitly sets fs.default.name and mapred.job.tracker so it uses the local filesystem and the local job runner. It then runs the MaxTemperatureDriver via its Tool interface against a small amount of known data. At the end of the test, the checkOut put() method is called to compare the actual output with the expected output, line by line.

The second way of testing the driver is to run it using a “mini-” cluster. Hadoop has a pair of testing classes, called MiniDFSCluster and MiniMRCluster, which provide a programmatic way of creating in-process clusters. Unlike the local job runner, these allow testing against the full HDFS and MapReduce machinery. Bear in mind, too, that tasktrackers in a mini-cluster launch separate JVMs to run tasks in, which can make debugging more difficult.

Mini-clusters are used extensively in Hadoop’s own automated test suite, but they can be used for testing user code, too. Hadoop’s ClusterMapReduceTestCase abstract class provides a useful base for writing such a test, handles the details of starting and stoppingthe in-process HDFS and MapReduce clusters in its setUp() and tearDown() methods, and generates a suitable JobConf object that is configured to work with them. Subclasses need populate only data in HDFS (perhaps by copying from a local file), run a Map- Reduce job, then confirm the output is as expected. Refer to the MaxTemperatureDriver MiniTest class in the example code that comes with this book for the listing.

Tests like this serve as regression tests, and are a useful repository of input edge cases and their expected results. As you encounter more test cases, you can simply add them to the input file and update the file of expected output accordingly.


All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

Hadoop Topics