Hadoop pipes - Hadoop

Hadoop Pipes is the name of the C++ interface to Hadoop MapReduce. Unlike Streaming, which uses standard input and output to communicate with the map and reduce code, Pipes uses sockets as the channel over which the tasktracker communicates with the process running the C++ map or reduce function. JNI is not used.

We’ll rewrite the example running through the chapter in C++, and then we’ll see how to run it using Pipes. Example shows the source code for the map and reduce functions in C++.

Example. Maximum temperature in C++

The application links against the Hadoop C++ library, which is a thin wrapper for communicating with the tasktracker child process. The map and reduce functions are defined by extending the Mapper and Reducer classes defined in the HadoopPipes namespace and providing implementations of the map() and reduce() methods in each case. These methods take a context object (of type MapContext or ReduceContext), which provides the means for reading input and writing output, as well as accessing job configuration information via the JobConf class. The processing in this example is very similar to the Java equivalent.

Unlike the Java interface, keys and values in the C++ interface are byte buffers, represented as Standard Template Library (STL) strings. This makes the interface simpler, although it does put a slightly greater burden on the application developer, who has to convert to and from richer domain-level types. This is evident in MapTempera tureReducer where we have to convert the input value into an integer (using a convenience method in HadoopUtils) and then the maximum value back into a string before it’s written out. In some cases, we can save on doing the conversion, such as in MaxTem peratureMapper where the airTemperature value is never converted to an integer since it is never processed as a number in the map() method.

The main() method is the application entry point. It calls HadoopPipes::runTask, which connects to the Java parent process and marshals data to and from the Mapper or Reducer. The runTask() method is passed a Factory so that it can create instances of the Mapper or Reducer. Which one it creates is controlled by the Java parent over the socket connection. There are overloaded template factory methods for setting a combiner, partitioner, record reader, or record writer.

Compiling and Running

Now we can compile and link our program using the Makefile in Example

Example . Makefile for C++ MapReduce program

The Makefile expects a couple of environment variables to be set. Apart from HADOOP_INSTALL (which you should already have set if you followed the installation instructions in Appendix A),you need to define PLATFORM, which specifies the operating system, architecture, and data model (e.g., 32- or 64-bit). I ran it on a 32-bit Linux system with the following:

On successful completion, you’ll find the max_temperature executable in the current directory.

run a Pipes job, we need to run Hadoop in pseudo-distributed mode (where all the daemons run on the local machine), for which there are setup instructions in Appendix A. Pipes doesn’t run in standalone (local) mode, since it relies on Hadoop’s distributed cache mechanism, which works only when HDFS is running.

With the Hadoop daemons now running, the first step is to copy the executable to HDFS so that it can be picked up by tasktrackers when they launch map and reduce tasks:

The sample data also needs to be copied from the local filesystem into HDFS:

Now we can run the job. For this, we use the Hadoop pipes command, passing the URI of the executable in HDFS using the -program argument:

We specify two properties using the -D option: hadoop.pipes.java.recordreader and hadoop.pipes.java.recordwriter, setting both to true to say that we have not specified a C++ record reader or writer, but that we want to use the default Java ones (which are for text input and output). Pipes also allows you to set a Java mapper, reducer, combiner, or partitioner.In fact, you can have a mixture of Java or C++ classes within any one job

.The result is the same as the other versions of the same program that we ran.

All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

Hadoop Topics