Tuning a Job - Hadoop

After a job is working, the question many developers ask is, “Can I make it run faster?” There are a few Hadoop-specific “usual suspects” that are worth checking to see if they are responsible for a performance problem. You should run through the checklist in below Table before you start trying to profile or optimize at the task level.

Tuning a Job

Profiling Tasks

Like debugging, profiling a job running on a distributed system like MapReduce presents some challenges. Hadoop allows you to profile a fraction of the tasks in a job, and, as each task completes, pulls down the profile information to your machine for later analysis with standard profiling tools.

Of course, it’s possible, and somewhat easier, to profile a job running in the local job runner. And provided you can run with enough input data to exercise the map and reduce tasks, this can be a valuable way of improving the performance of your mappers and reducers. There are a couple of caveats, however. The local job runner is a very different environment from a cluster, and the data flow patterns are very different.Optimizing the CPU performance of your code may be pointless if your MapReduce job is I/O-bound (as many jobs are). To be sure that any tuning is effective, you should compare the new execution time with the old running on a real cluster. Even this is easier said than done, since job execution times can vary due to resource contention with other jobs and the decisions the scheduler makes to do with task placement. Toget a good idea of job execution time under these circumstances, perform a series ofruns (with and without the change) and check whether any improvement is statistically significant.

It’s unfortunately true that some problems (such as excessive memory use) can be reproduced only on the cluster, and in these cases the ability to profile in situ is indispensable.

The HPROF profiler

There are a number of configuration properties to control profiling, which are also exposed via convenience methods on JobConf. The following modification to MaxTemperatureDriver (version 6) will enable remote HPROF profiling. HPROF is a profiling tool that comes with the JDK that, although basic, can give valuable information about a program’s CPU and heap usage:§

The first line enables profiling, which by default is turned off. (This is equivalent to setting the configuration property mapred.task.profile to true).

Next we set the profile parameters, which are the extra command-line arguments to pass to the task’s JVM. (When profiling is enabled, a new JVM is allocated for each task, even if JVM reuse is turned on; see “Task JVM Reuse”.) The default parameters specify the HPROF profiler; here we set an extra HPROF option, depth=6, to give more stack trace depth than the HPROF default. The setProfileParams()method on JobConf is equivalent to setting the mapred.task.profile.params.

Finally, we specify which tasks we want to profile. We normally only want profile information from a few tasks, so we use the setProfileTaskRange() method to specify the range of task IDs that we want profile information for. We’ve set it to 0-2 (which is actually the default), which means tasks with IDs 0, 1, and 2 are profiled. The first argument to the setProfileTaskRange() method dictates whether the range is for map or reduce tasks: true is for maps, false is for reduces. A set of ranges is permitted, using a notation that allows open ranges. For example, 0-1,4,6- would specify all tasks except those with IDs 2, 3, and 5. The tasks to profile can also be controlled using the mapred.task.profile.maps property for map tasks, and mapred.task.profile.reduces for reduce tasks.

When we run a job with the modified driver, the profile output turns up at the end of the job in the directory we launched the job from. Since we are only profiling a few tasks, we can run the job on a subset of the dataset.

§ HPROF uses byte code insertion to profile your code, so you do not need to recompile your application with special options to use it. For more information on HPROF, see “HPROF: A Heap/CPU Profiling Tool in J2SE 5.0,” by Kelly O’Hair at http://java.sun.com/developer/technicalArticles/Programming/HPROF.html.

Here’s a snippet of one of the mapper’s profile files, which shows the CPU sampling information:

Cross-referencing the trace number 307973 gives us the stacktrace from the same file:

So it looks like the mapper is spending 3% of its time constructing IntWritable objects. This observation suggests that it might be worth reusing the Writable instances being output.

However, we know if this is significant only if we can measure an improvement when running the job over the whole dataset. Running each variant five times on an otherwise quiet 11-node cluster showed no statistically significant difference in job execution time. Of course, this result holds only for this particular combination of code, data, and hardware, so you should perform similar benchmarks to see whether such a change is significant for your setup.

Other profilers

At the time of this writing, the mechanism for retrieving profile output is HPROFspecific. Until this is fixed, it should be possible to use Hadoop’s profiling settings to trigger profiling using any profiler (see the documentation for the particular profiler), although it may be necessary to manually retrieve the profiler’s output from tasktrackers for analysis.

If the profiler is not installed on all the tasktracker machines, consider using the Distributed Cache (“Distributed Cache” ) for making the profiler binary available on the required machines

All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

Hadoop Topics