Hadoop Archives - Hadoop

HDFS stores small files inefficiently, since each file is stored in a block, and block metadata is held in memory by the namenode. Thus, a large number of small files can eat up a lot of memory on the namenode. (Note, however, that small files do not take up any more disk space than is required to store the raw contents of the file. For example, a 1 MB file stored with a block size of 128 MB uses 1 MB of disk space, not 128 MB.)

Hadoop Archives, or HAR files, are a file archiving facility that packs files into HDFS blocks more efficiently, thereby reducing namenode memory usage while still allowing transparent access to files. In particular, Hadoop Archives can be used as input to MapReduce.

Using Hadoop Archives

A Hadoop Archive is created from a collection of files using the archive tool. The tool runs a MapReduce job to process the input files in parallel, so to run it, you need a MapReduce cluster running to use it. Here are some files in HDFS that we would like to archive:

The first option is the name of the archive, here files.har. HAR files always have a .har extension, which is mandatory for reasons we shall see later. Next comes the files to put in the archive. Here we are archiving only one source tree, the files in /my/files in HDFS, but the tool accepts multiple source trees. The final argument is the output directory for the HAR file. Let’s see what the archive has created:

The directory listing shows what a HAR file is made of: two index files and a collection of part files just one in this example. The part files contain the contents of a number of the original files concatenated together, and the indexes make it possible to look upthe part file that an archived file is contained in, and its offset and length. All these details are hidden from the application, however, which uses the har URI scheme to interact with HAR files, using a HAR filesystem that is layered on top of the underlying filesystem (HDFS in this case). The following command recursively lists the files in the archive:

This is quite straightforward if the filesystem that the HAR file is on is the default filesystem. On the other hand, if you want to refer to a HAR file on a different filesystem, then you need to use a different form of the path URI to normal. These two commands have the same effect, for example:

Notice in the second form that the scheme is still har to signify a HAR filesystem, but the authority is hdfs to specify the underlying filesystem’s scheme, followed by a dash and the HDFS host (localhost) and port (8020). We can now see why HAR files haveto have a .har extension. The HAR filesystem translates the har URI into a URI for the underlying filesystem, by looking at the authority and path up to and including the component with the .har extension. In this case, it is hdfs://localhost:8020/my/files .har. The remaining part of the path is the path of the file in the archive: /my/files/dir.

To delete a HAR file, you need to use the recursive form of delete, since from the underlying filesystem’s point of view the HAR file is a directory:

Limitations

There are a few limitations to be aware of with HAR files. Creating an archive creates a copy of the original files, so you need as much disk space as the files you are archiving to create the archive (although you can delete the originals once you have created the archive). There is currently no support for archive compression, although the files that go into the archive can be compressed (HAR files are like tar files in this respect).

Archives are immutable once they have been created. To add or remove files, you must re-create the archive. In practice, this is not a problem for files that don’t change after being written, since they can be archived in batches on a regular basis, such as daily or weekly.

As noted earlier, HAR files can be used as input to MapReduce. However, there is no archive-aware InputFormat that can pack multiple files into a single MapReduce split, so processing lots of small files, even in a HAR file, can still be inefficient. “Small files and CombineFileInputFormat” discusses another approach to this problem.


All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

Hadoop Topics