Parallel Copying with distcp - Hadoop

The HDFS access patterns that we have seen so far focus on single-threaded access. It’s possible to act on a collection of files, by specifying file globs, for example, but for efficient, parallel processing of these files you would have to write a program yourself. Hadoop comes with a useful program called distcp for copying large amounts of data to and from Hadoop filesystems in parallel.

The canonical use case for distcp is for transferring data between two HDFS clusters. If the clusters are running identical versions of Hadoop, the hdfs scheme is appropriate:

This will copy the /foo directory (and its contents) from the first cluster to the /bar directory on the second cluster, so the second cluster ends up with the directory structure /bar/foo. If /bar doesn’t exist, it will be created first. You can specify multiple source paths, and all will be copied to the destination. Source paths must be absolute.

By default, distcp will skip files that already exist in the destination, but they can be overwritten by supplying the -overwrite option. You can also update only files that have changed using the -update option.

Using either (or both) of -overwrite or -update changes how the source and destination paths are interpreted. This is best shown with an example.If we changed a file in the /foo subtree on the first cluster from the previous example, then we could synchronize the change with the second cluster by running:

The extra trailing /foo sub directory is needed on the destination, as now the contents of the source directory are copied to the contents of the destination directory. (If you are familiar with rsync, you can think of the -overwrite or -update options as adding an implicit trailing slash to the source.)

If you are unsure of the effect of a distcp operation, it is a good idea to try it out on a small test directory tree first.

There are more options to control the behavior of distcp, including ones to preserve file attributes, ignore failures, and limit the number of files or total data copied. Run it with no options to see the usage instructions.

distcp is implemented as a MapReduce job where the work of copying is done by the maps that run in parallel across the cluster. There are no reducers. Each file is copied by a single map, and distcp tries to give each map approximately the same amount of data, by bucketing files into roughly equal allocations.

The number of maps is decided as follows. Since it’s a good idea to get each map to copy a reasonable amount of data to minimize overheads in task setup, each map copies at least 256 MB (unless the total size of the input is less, in which case one map handlesit all). For example, 1 GB of files will be given four map tasks. When the data size is very large, it becomes necessary to limit the number of maps in order to limit bandwidth and cluster utilization. By default, the maximum number of maps is 20 per (tasktracker) cluster node. For example, copying 1,000 GB of files to a 100-node cluster will allocate 2,000 maps (20 per node), so each will copy 512 MB on average. This can be reduced by specifying the -m argument to distcp. For example, -m 1000 would allocate 1,000 maps, each copying 1 GB on average.

If you try to use distcp between two HDFS clusters that are running different versions, the copy will fail if you use the hdfs protocol, since the RPC systems are incompatible. To remedy this, you can use the read-only HTTP-based HFTP filesystem to read from the source. The job must run on the destination cluster so that the HDFS RPC versions are compatible. To repeat the previous example using HFTP:

Note that you need to specify the namenode’s web port in the source URI. This is determined by the dfs.http.address property, which defaults to 50070.

Keeping an HDFS Cluster Balanced

When copying data into HDFS, it’s important to consider cluster balance. HDFS works best when the file blocks are evenly spread across the cluster, so you want to ensure that distcp doesn’t disrupt this. Going back to the 1,000 GB example, by specifying -m1 a single map would do the copy, which apart from being slow and not using the cluster resources efficiently would mean that the first replica of each block would reside on the node running the map (until the disk filled up). The second and third replicas would be spread across the cluster, but this one node would be unbalanced. By having more maps than nodes in the cluster, this problem is avoided for this reason, it’s best to start by running distcp with the default of 20 maps per node.

However, it’s not always possible to prevent a cluster from becoming unbalanced. Perhaps you want to limit the number of maps so that some of the nodes can be used by other jobs. In this case, you can use the balancer tool to subsequently even out the block distribution across the cluster


All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

Hadoop Topics