Maintenance - Hadoop

Routine Administration Procedures

Metadata backups

If the namenode’s persistent metadata is lost or damaged, the entire filesystem is rendered unusable, so it is critical that backups are made of these files. You should keep multiple copies of different ages (one hour, one day, one week, and one month, say) to protect against corruption, either in the copies themselves or in the live files running on the namenode.

A straightforward way to make backups is to write a script to periodically archive the secondary namenode’s previous.checkpoint subdirectory (under the directory defined by the fs.checkpoint.dir property) to an offsite location. The script should additionally test the integrity of the copy. This can be done by starting a local namenode daemon and verifying that it has successfully read the fsimage and edits files into memory (byscanning the namenode log for the appropriate success message, for example).

Data backups

Although HDFS is designed to store data reliably, data loss can occur, just like in any storage system, and thus a backup strategy is essential. With the large data volumes that Hadoop can store, deciding what data to back up and where to store it is a challenge.

The key here is to prioritize your data. The highest priority is the data that cannot be regenerated and that is critical to the business; however, data that is straight forward to regenerate, or essentially disposable because it is of limited business value, is the lowest priority, and you may choose not to make backups of this category of data.

Do not make the mistake of thinking that HDFS replication is a substitute for making backups. Bugs in HDFS can cause replicas to be lost, and so can hardware failures. Although Hadoop is expressly designedso that hardware failure is very unlikely to result in data loss, the possibility can never be completely ruled out, particularly when combined with software bugs or human error.

When it comes to backups, think of HDFS in the same way as you would RAID. Although the data will survive the loss of an individual RAID disk, it may not if the RAID controller fails, or is buggy (perhaps overwriting some data), or the entire array is damaged.

It’s common to have a policy for user directories in HDFS. For example, they may have space quotas and be backed up nightly. Whatever the policy, make sure your users know what it is, so they know what to expect.

The distcp tool is ideal for making backups to other HDFS clusters (preferably running on a different version of the software, to guard against loss due to bugs in HDFS) or other Hadoop filesystems (such as S3 or KFS), since it can copy files in parallel. Alternatively, you can employ an entirely different storage system for backups, using one of the ways to export data from HDFS described in “Hadoop Filesystems”.

Filesystem check (fsck)

It is advisable to run HDFS’s fsck tool regularly (for example, daily) on the whole filesystem to proactively look for missing or corrupt blocks. See “Filesystem check (fsck)”.

Filesystem balancer

Run the balancer tool (see “balancer” ) regularly to keep the filesystem datanodes evenly balanced.

Commissioning and Decommissioning Nodes

As an administrator of a Hadoop cluster, you will need to add or remove nodes from time to time. For example, to grow the storage available to a cluster, you commission new nodes. Conversely, sometimes you may wish to shrink a cluster, and to do so, you decommission nodes. It can sometimes be necessary to decommission a node if it is misbehaving, perhaps because it is failing more often than it should or its performanceis noticeably slow.

Nodes normally run both a datanode and a tasktracker, and both are typically commissioned or decommissioned in tandem.

Commissioning new nodes

Although commissioning a new node can be as simple as configuring the hdfssite. xml file to point to the namenode and the mapred-site.xml file to point to the jobtracker, and starting the datanode and jobtracker daemons, it is generally best to have a list of authorized nodes.

It is a potential security risk to allow any machine to connect to the namenode and act as a datanode, since the machine may gain access to data that it is not authorized to see. Furthermore, since such a machine is not a real datanode, it is not under your control, and may stop at any time, causing potential data loss. (Imagine what would happen if a number of such nodes were connected, and a block of data was presentonly on the “alien” nodes?) This scenario is a risk even inside a firewall, through misconfiguration, so datanodes (and tasktrackers) should be explicitly managed on all production clusters.

Datanodes that are permitted to connect to the namenode are specified in a file whose name is specified by the dfs.hosts property. The file resides on the namenode’s local filesystem, and it contains a line for each datanode, specified by network address (as reported by the datanode you can see what this is by looking at the namenode’s web UI). If you need to specify multiple network addresses for a datanode, put them on one line, separated by whitespace.

Similarly, tasktrackers that may connect to the jobtracker are specified in a file whose name is specified by the mapred.hosts property. In most cases, there is one shared file, referred to as the include file, that both dfs.hosts and mapred.hosts refer to, since nodes in the cluster run both datanode and tasktracker daemons.

The file (or files) specified by the dfs.hosts and mapred.hosts properties is different from the slaves file. The former is used by the namenode and jobtracker to determine which worker nodes may connect. The slavesfile is used by the Hadoop control scripts to perform cluster-wide operations, such as cluster restarts. It is never used by the Hadoop daemons.

To add new nodes to the cluster:

  1. Add the network addresses of the new nodes to the include file.
  2. Update the namenode with the new set of permitted datanodes using this command:
  3. % hadoop dfsadmin -refreshNodes
  4. Update the slaves file with the new nodes, so that they are included in future operations performed by the Hadoop control scripts.
  5. Start the new datanodes.
  6. Restart the MapReduce cluster.
  7. Check that the new datanodes and tasktrackers appear in the web UI.

* At the time of this writing, there is no command to refresh the set of permitted nodes in the jobtracker. Consider setting the mapred.jobtracker.restart.recover property to true to make the jobtracker recoverrunning jobs after a restart.

HDFS will not move blocks from old datanodes to new datanodes to balance the cluster. To do this, you should run the balancer described in “balancer” .

Decommissioning old nodes

Although HDFS is designed to tolerate datanode failures, this does not mean you can just terminate datanodes en masse with no ill effect. With a replication level of three, for example, the chances are very high that you will lose data by simultaneously shutting down three datanodes if they are on different racks. The way to decommission datanodes is to inform the namenode of the nodes that you wish to take out of circulation,so that it can replicate the blocks to other datanodes before the datanodes are shut down.

With tasktrackers, Hadoop is more forgiving. If you shut down a tasktracker that is running tasks, the jobtracker will notice the failure and reschedule the tasks on other tasktrackers.

The decommissioning process is controlled by an exclude file, which for HDFS is set by the dfs.hosts.exclude property and for MapReduce by the mapred.hosts.exclude property. It is often the case that these properties refer to the same file. The exclude file lists the nodes that are not permitted to connect to the cluster.

The rules for whether a tasktracker may connect to the jobtracker are simple: a tasktracker may connect only if it appears in the include file and does not appear in the exclude file. An unspecified or empty include file is taken to mean that all nodes are in the include file.

For HDFS, the rules are slightly different. If a datanode appears in both the include and the exclude file, then it may connect, but only to be decommissioned. Table summarizes the different combinations for datanodes. As for tasktrackers, an unspecified or empty include file means all nodes are included.

Decommissioning old nodes

Table HDFS include and exclude file precedence

. To remove nodes from the cluster:

  1. Add the network addresses of the nodes to be decommissioned to the exclude file. Do not update the include file at this point.
  2. Restart the MapReduce cluster to stop the tasktrackers on the nodes being decommissioned.
  3. Update the namenode with the new set of permitted datanodes, with this command: % hadoop dfsadmin -refreshNodes
  4. Go to the web UI and check whether the admin state has changed to “Decommission In Progress” for the datanodes being decommissioned. They will start copying their blocks to other datanodes in the cluster.

  5. When all the datanodes report their state as “Decommissioned,” then all the blocks have been replicated. Shut down the decommissioned nodes.

  6. Remove the nodes from the include file, and run:
  7. % hadoop dfsadmin -refreshNodes
  8. Remove the nodes from the slaves file.


Upgrading an HDFS and MapReduce cluster requires careful planning. The most important consideration is the HDFS upgrade. If the layout version of the filesystem has changed, then the upgrade will automatically migrate the filesystem data and metadata to a format that is compatible with the new version. As with any procedure that involves data migration, there is a risk of data loss, so you should be sure that both your dataand metadata is backed up (see “Routine Administration Procedures” ).

Part of the planning process should include a trial run on a small test cluster with a copy of data that you can afford to lose. A trial run will allow you to familiarize yourself with the process, customize it to your particular cluster configuration and toolset, and iron out any snags before running the upgrade procedure on a production cluster. A test cluster also has the benefit of being available to test client upgrades on.

Version Compatibility

All pre-1.0 Hadoop components have very rigid version compatibility requirements. Only components from the same release are guaranteed to be compatible with each other, which means the whole system from daemons to clients has to be upgraded simultaneously, in lockstep. This necessitates a period of cluster downtime.

Version 1.0 of Hadoop promises to loosen these requirements so that, for example, older clients can talk to newer servers (within the same major release number). In later releases, rolling upgrades may be supported, which would allow cluster daemons to be upgraded in phases, so that the cluster would still be available to clients during the upgrade.

Upgrading a cluster when the filesystem layout has not changed is fairly straightforward: install the new versions of HDFS and MapReduce on the cluster (and on clients at the same time), shut down the old daemons, update configuration files, then start up the new daemons and switch clients to use the new libraries. This process is reversible, so rolling back an upgrade is also straight forward.

After every successful upgrade, you should perform a couple of final cleanup steps:

  • Remove the old installation and configuration files from the cluster.
  • Fix any deprecation warnings in your code and configuration.

HDFS data and metadata upgrades

If you use the procedure just described to upgrade to a new version of HDFS and it expects a different layout version, then the namenode will refuse to run. A message like the following will appear in its log:

File system image contains an old layout version -16.An upgrade to version -18 is required.Please restart NameNode with -upgrade option.

The most reliable way of finding out whether you need to upgrade the filesystem is by performing a trial on a test cluster.

An upgrade of HDFS makes a copy of the previous version’s metadata and data. Doing an upgrade does not double the storage requirements of the cluster, as the datanodes use hard links to keep two references (for the current and previous version) to the same block of data. This design makes it straight forward to roll back to the previous version of the filesystem, should you need to. You should understand that any changes madeto the data on the upgraded system will be lost after the rollback completes.

You can keep only the previous version of the filesystem: you can’t roll back several versions. Therefore, to carry out another upgrade to HDFS data and meta data, you will need to delete the previous version, a process called finalizing the upgrade. Once an upgrade is finalized, there is no procedure for rolling back to a previous version.

In general, you can skip releases when upgrading (for example, you can upgrade from release 0.18.3 to 0.20.0 without having to upgrade to a 0.19.x release first), but in some cases, you may have to go through intermediate releases. The release notes make it clear when this is required.

You should only attempt to upgrade a healthy filesystem. Before running the upgrade, do a full fsck (see “Filesystem check (fsck)”). As an extra precaution, you can keep a copy of the fsck output that lists all the files and blocks in the system, so you can compare it with the output of running fsck after the upgrade.

It’s also worth clearing out temporary files before doing the upgrade, both from the MapReduce system directory on HDFS and local temporary files.

With these preliminaries out of the way, here is the high-level procedure for upgrading a cluster when the filesystem layout needs to be migrated:

  1. Make sure that any previous upgrade is finalized before proceeding with another upgrade.
  2. Shut down MapReduce and kill any orphaned task processes on the tasktrackers.
  3. Shut down HDFS and backup the namenode directories.
  4. Install new versions of Hadoop HDFS and MapReduce on the cluster and on clients.
  5. Start HDFS with the -upgrade option.
  6. Wait until the upgrade is complete.
  7. Perform some sanity checks on HDFS.
  8. Start MapReduce.
  9. Roll back or finalize the upgrade (optional).

While running the upgrade procedure, it is a good idea to remove the Hadoop scripts from your PATH environment variable. This forces you to be explicit about which version of the scripts you are running. It can be convenient to define two environment variables for the new installation directories; in the following instructions, we have definedOLD_HADOOP_INSTALL and NEW_HADOOP_INSTALL.

Start the upgrade.To perform the upgrade, run the following command (this is step 5 in the high-level upgrade procedure):

This causes the namenode to upgrade its metadata, placing the previous version in a new directory called previous:

Similarly, datanodes upgrade their storage directories, preserving the old copy in a directory called previous.

Wait until the upgrade is complete.The upgrade process is not instantaneous, but you can check the progress of an upgrade using dfsadmin (upgrade events also appear in the daemons’ logfiles, step 6):

Upgrade for version -18 has been completed.

Upgrade is not finalized.

This shows that the upgrade is complete. At this stage, you should run some sanity checks (step 7) on the filesystem (check files and blocks using fsck, basic file operations). You might choose to put HDFS into safe mode while you are running some of these checks (the ones that are read-only) to prevent others from making changes

.Roll back the upgrade (optional).If you find that the new version is not working correctly, you may choose to roll back to the previous version (step 9). This is only possible if you have not finalized the upgrade.

A rollback reverts the filesystem state to before the upgrade was performed, so any changes made in the meantime will be lost. In other words, it rolls back to the previous state of the filesystem, rather thandowngrading the current state of the filesystem to a former version.

First, shut down the new daemons:

This command gets the namenode and datanodes to replace their current storage directories with their previous copies. The filesystem will be returned to its previous state.

Finalize the upgrade (optional).When you are happy with the new version of HDFS, you can finalize the upgrade (step 9) to remove the previous storage directories. After an upgrade has been finalized, there is no way to roll back to the previous version.

This step is required before performing another upgrade:

HDFS is now fully upgraded to the new version.

All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd Protection Status

Hadoop Topics