Cluster Setup and Installation - Hadoop

Your hardware has arrived. The next steps are to get it racked up and install the software needed to run Hadoop.

There are various ways to install and configure Hadoop. This chapter describes how to do it from scratch using the Apache Hadoop distribution, and will give you the background to cover the things you need to think about when setting up Hadoop.

Alternatively, if you would like to use RPMs or Debian packages for managing your Hadoop installation, then you might want to start with Cloudera’s Distribution, described in Appendix B.

To ease the burden of installing and maintaining the same software on each node, it is normal to use an automated installation method like Red Hat Linux’s Kickstart or Debian’s Fully Automatic Installation. These tools allow you to automate the operating system installation by recording the answers to questions that are asked during the installation process (such as the disk partition layout), as well as which packages to install. Crucially, they also provide hooks to run scripts at the end of the process, which are invaluable for doing final system tweaks and customization that is not covered by the standard installer.

The following sections describe the customizations that are needed to run Hadoop. These should all be added to the installation script.

Installing Java

Java 6 or later is required to run Hadoop. The latest stable Sun JDK is the preferred option, although Java distributions from other vendors may work, too. The following command confirms that Java was installed correctly:

Creating a Hadoop User

It’s good practice to create a dedicated Hadoop user account to separate the Hadoop installation from other services running on the same machine.

Some cluster administrators choose to make this user’s home directory an NFSmounted drive, to aid with SSH key distribution (see the following discussion). The NFS server is typically outside the Hadoop cluster. If you use NFS, it is worth considering autofs, which allows you to mount the NFS filesystem on demand, when the system accesses it. Autofs provides some protection against the NFS server failing and allows you to use replicated filesystems for failover. There are other NFS gotchas to watch out for, such as synchronizing UIDs and GIDs. For help setting up NFS on Linux, refer to the HOWTO at

Installing Hadoop

Download Hadoop from the Apache Hadoop releases page ( core/releases.html), and unpack the contents of the distribution in a sensible location, such as /usr/local (/opt is another standard choice). Note that Hadoop is not installed in the hadoop user’s home directory, as that may be an NFS-mounted directory:

Some administrators like to install HDFS and MapReduce in separate locations on the same system. At the time of this writing, only HDFS and MapReduce from the same Hadoop release are compatible with oneanother; however, in future releases, the compatibility requirements will be loosened. When this happens, having independent installations makes sense, as it gives more upgrade options (for more, see “Upgrades” ). For example, it is convenient to be able to upgrade MapReduce perhaps to patch a bug while leaving HDFS running.

Note that separate installations of HDFS and MapReduce can still share configuration by using the --config option (when starting daemons) to refer to a common configuration directory. They can also log to the samedirectory, as the logfiles they produce are named in such a way as to avoid clashes.

Testing the Installation

Once you’ve created the installation file, you are ready to test it by installing it on the machines in your cluster. This will probably take a few iterations as you discover kinks in the install. When it’s working, you can proceed to configure Hadoop and give it a test run. This process is documented in the following sections.

All rights reserved © 2020 Wisdom IT Services India Pvt. Ltd Protection Status

Hadoop Topics