Hadoop in the Cloud - Hadoop

Although many organizations choose to run Hadoop in-house, it is also popular to run Hadoop in the cloud on rented hardware or as a service. For instance, Cloudera offers tools for running Hadoop (see Appendix B) in a public or private cloud, and Amazon has a Hadoop cloud service called Elastic MapReduce.

In this section, we look at running Hadoop on Amazon EC2, which is a great way to try out your own Hadoop cluster on a low-commitment, trial basis.

Hadoop on Amazon EC2

Amazon Elastic Compute Cloud (EC2) is a computing service that allows customers to rent computers (instances) on which they can run their own applications. A customer can launch and terminate instances on demand, paying by the hour for active instances. The Apache Whirr project provides a set of scripts that make it easy to run Hadoop on EC2 and other cloud providers. The scripts allow you to perform such operations as launching or terminating a cluster, or adding instances to an existing cluster.

Running Hadoop on EC2 is especially appropriate for certain workflows. For example, if you store data on Amazon S3, then you can run a cluster on EC2 and run MapReduce jobs that read the S3 data and write output back to S3, before shutting down the cluster.

If you’re working with longer-lived clusters, you might copy S3 data onto HDFS running on EC2 for more efficient processing, as HDFS can take advantage of data locality, but S3 cannot (since S3 storage is not collocated with EC2 nodes).


Before you can run Hadoop on EC2, you need to work through Amazon’s Getting Started Guide (linked from the EC2 website http://aws.amazon.com/ec2/), which goes through setting up an account, installing the EC2 command-line tools, and launching an instance.

Next, install Whirr, then configure the scripts to set your Amazon Web Service credentials, security key details, and the type and size of server instances to use. Detailed instructions for doing this may be found in Whirr’s README file.

Launching a cluster

We are now ready to launch a cluster. To launch a cluster named test-hadoop-cluster with one master node (running the namenode and jobtracker) and five worker nodes (running the datanodes and tasktrackers), type:

This will create EC2 security groups for the cluster, if they don’t already exist, and give the master and worker nodes unfettered access to one another. It will also enable SSH access from anywhere. Once the security groups have been set up, the master instance will be launched; then, once it has started, the five worker instances will be launched.

The reason that the worker nodes are launched separately is so that the master’s hostname can be passed to the worker instances, and allow the datanodes and tasktrackers to connect to the master when they start up.

There are also bash scripts in the src/contrib/ec2 subdirectory of the Hadoop distribution, but these are deprecated in favor of Whirr. In this section, we use Whirr’s Python scripts (found in contrib/python), butnote that Whirr also has Java libraries that implement similar functionality.

To use the cluster, network traffic from the client needs to be proxied through the master node of the cluster using an SSH tunnel, which we can set up using the following command:

Running a MapReduce job

You can run MapReduce jobs either from within the cluster or from an external machine. Here we show how to run a job from the machine we launched the cluster on. Note that this requires that the same version of Hadoop has been installed locally as is running on the cluster.

When we launched the cluster, a hadoop-site.xml file was created in the directory ~/.hadoop-cloud/test-hadoop-cluster. We can use this to connect to the cluster by setting the HADOOP_CONF_DIR environment variable as follows:

The cluster’s filesystem is empty, so before we run a job, we need to populate it with data. Doing a parallel copy from S3 (see “Hadoop Filesystems” for more on the S3 filesystems in Hadoop) using Hadoop’s distcp tool is an efficient way to transfer data into HDFS:

After the data has been copied, we can run a job in the usual way:

Alternatively, we could have specified the input to be S3, which would have the same effect. When running multiple jobs over the same input data, it’s best to copy the data to HDFS first to save bandwidth:

You can track the progress of the job using the jobtracker’s web UI, found at http:// master_host:50030/. To access web pages running on worker nodes, you need set up a proxy auto-config (PAC) file in your browser. See the Whirr documentation for details on how to do this.

Terminating a cluster

To shut down the cluster, issue the terminate-cluster command:

You will be asked to confirm that you want to terminate all the instances in the cluster. Finally, stop the proxy process (the HADOOP_CLOUD_PROXY_PID environment variable was set when we started the proxy):

All rights reserved © 2020 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

Hadoop Topics