Compression Hadoop - Hadoop

File compression brings two major benefits: it reduces the space needed to store files, and it speeds up data transfer across the network, or to or from disk. When dealing with large volumes of data, both of these savings can be significant, so it pays to carefully consider how to use compression in Hadoop.

A summary of compression formats

There are many different compression formats, tools and algorithms, each with different characteristics. lists some of the more common ones that can be used with Hadoop.

aDEFLATE is a compression algorithm whose standard implementation is zlib. There is no commonly available command-line tool forproducing files in DEFLATE format, as gzip is normally used. (Note that the gzip file format is DEFLATE with extra headers and a footer.) The .deflate filename extension is a Hadoop convention.

All compression algorithms exhibit a space/time trade-off: faster compression and decompression speeds usually come at the expense of smaller space savings. All of the tools listed in Table give some control over this trade-off at compression time by offering nine different options: –1 means optimize for speed and -9 means optimize for space. For example, the following command creates a compressed file file.gz using the fastest compression method:

The different tools have very different compression characteristics. Gzip is a generalpurpose compressor, and sits in the middle of the space/time trade-off. Bzip2 compresses more effectively than gzip, but is slower. Bzip2’s decompression speed is faster than its compression speed, but it is still slower than the other formats. LZO, on the other hand, optimizes for speed: it is faster than gzip (or any other compression or decompression tool†), but compresses slightly less effectively.

The “Splittable” column in Table indicates whether the compression format supports splitting; that is, whether you can seek to any point in the stream and start reading from some point further on. Splittable compression formats are especially suitable for MapReduce; see “Compression and Input Splits” on for further discussion.

Codecs

A codec is the implementation of a compression-decompression algorithm. In Hadoop, a codec is represented by an implementation of the CompressionCodec interface. So, for

At the time of this writing, Hadoop does not support ZIP compression. See https://issues.apache.org/jira/browse/MAPREDUCE-210.

Jeff Gilchrist’s Archive Comparison Test at http://compression.ca/act/act-summary.html contains benchmarksfor compression and decompression speed, and compression ratio for a wide range of tools.

example, GzipCodec encapsulates the compression and decompression algorithm for gzip.

Hadoop compression codecs

The LZO libraries are GPL-licensed and may not be included in Apache distributions,so for this reason the Hadoop codecs must be downloaded separately from http://code.google.com/p/hadoop-gpl-compression/ (or http://github.com/kevinweil/hadoop-lzo,which includes bugfixes and more tools). The LzopCodec is compatible with the lzop tool, which is essentially the LZO format with extra headers, and is the one you normally want. There is also a LzoCodec for the pure LZO format,which uses the .lzo_deflate filename extension (by analogy with DEFLATE, which is gzip without the headers).

Compressing and decompressing streams with CompressionCodec

CompressionCodec has two methods that allow you to easily compress or decompress data. To compress data being written to an output stream, use the createOutput Stream(OutputStream out) method to create a CompressionOutputStream to which you write your uncompressed data to have it written in compressed form to the underlying stream. Conversely, to decompress data being read from an input stream, call createInputStream(InputStream in) to obtain a CompressionInputStream, which allows you to read uncompressed data from the underlying stream.

CompressionOutputStream and CompressionInputStream are similar to java.util.zip.DeflaterOutputStream and java.util.zip.DeflaterInputStream, except that both of the former provide the ability to reset their underlying compressor or decompressor,which is important for applications that compress sections of the data stream as separate blocks, such as SequenceFile, described in “Sequence- File” .

how to use the API to compress data read from standard input and write it to standard output.

A program to compress data read from standard input and write it to standard output

The application expects the fully qualified name of the CompressionCodec implementation as the first command-line argument. We use ReflectionUtils to construct a new instance of the codec, then obtain a compression wrapper around System.out. Then we call the utility method copyBytes() on IOUtils to copy the input to the output, which is compressed by the CompressionOutputStream. Finally, we call finish() on CompressionOutputStream, which tells the compressor to finish writing to the compressed stream, but doesn’t close the stream. We can try it out with the following command line, which compresses the string “Text” using the StreamCompressor programwith the GzipCodec, then decompresses it from standard input using gunzip:

Inferring CompressionCodecs using CompressionCodecFactory If you are reading a compressed file, you can normally infer the codec to use by looking at its filename extension. A file ending in .gz can be read with GzipCodec, and so on. The extension for each compression format is listed in below.

CompressionCodecFactory provides a way of mapping a filename extension to a CompressionCodec using its getCodec() method, which takes a Path object for the file in question. Example shows an application that uses this feature to decompress files.

A program to decompress a compressed file using a codec inferred from the file’s extension

Once the codec has been found, it is used to strip off the file suffix to form the output filename (via the removeSuffix() static method of CompressionCodecFactory). In this way, a file named file.gz is decompressed to file by invoking the program as follows:

CompressionCodecFactory finds codecs from a list defined by the io.compression.codecs configuration property. By default, this lists all the codecs provided by Hadoop (see Table), so you would need to alter it only if you have a custom codec that you wish to register (such as the externally hosted LZO codecs). Each codec knows its default filename extension, thus permitting CompressionCodecFactory to search through the registered codecs to find a match for a given extension (if any).

compression code properties

Native libraries

For performance, it is preferable to use a native library for compression and decompression. For example, in one test, using the native gzip libraries reduced decompression times by up to 50% and compression times by around 10% (compared to the built-in Java implementation). Tableshows the availability of Java and native implementations for each compression format. Not all formats have native implementations (bzip2, for example), whereas others are only available as a native implementation (LZO, for example).

Native libraries

Hadoop comes with prebuilt native compression libraries for 32- and 64-bit Linux, which you can find in the lib/native directory. For other platforms, you will need to compile the libraries yourself, following the instructions on the Hadoop wiki at

http://wiki.apache.org/hadoop/NativeHadoop.

The native libraries are picked up using the Java system property java.library.path.The hadoop script in the bin directory sets this property for you, but if you don’t use this script, you will need to set the property in your application.

By default, Hadoop looks for native libraries for the platform it is running on, and loads them automatically if they are found. This means you don’t have to change any configuration settings to use the native libraries. In some circumstances, however, you maywish to disable use of native libraries, such as when you are debugging a compressionrelated problem. You can achieve this by setting the property hadoop.native.lib to false, which ensures that the built-in Java equivalents will be used (if they are available).

CodecPool.

If you are using a native library and you are doing a lot of compression or decompression in your application, consider using CodecPool, which allows you to reuse compressors and decompressors, thereby amortizing the cost of creating these objects.The code in Example shows the API, although in this program, which only creates a single Compressor, there is really no need to use a pool.

A program to compress data read from standard input and write it to standard output using a pooled compressor

We retrieve a Compressor instance from the pool for a given CompressionCodec, which we use in the codec’s overloaded createOutputStream() method. By using a finally block, we ensure that the compressor is returned to the pool even if there is anIOException while copying the bytes between the streams.

Compression and Input Splits

When considering how to compress data that will be processed by MapReduce, it is important to understand whether the compression format supports splitting. Consider an uncompressed file stored in HDFS whose size is 1 GB. With an HDFS block size of 64 MB, the file will be stored as 16 blocks, and a MapReduce job using this file as input will create 16 input splits, each processed independently as input to a separate map task.

Imagine now the file is a gzip-compressed file whose compressed size is 1 GB. As before, HDFS will store the file as 16 blocks. However, creating a split for each block won’t work since it is impossible to start reading at an arbitrary point in the gzip stream, andtherefore impossible for a map task to read its split independently of the others. The gzip format uses DEFLATE to store the compressed data, and DEFLATE stores data as a series of compressed blocks. The problem is that the start of each block is not distinguished in any way that would allow a reader positioned at an arbitrary point in the stream to advance to the beginning of the next block, thereby synchronizing itself with the stream. For this reason, gzip does not support splitting.

In this case, MapReduce will do the right thing and not try to split the gzipped file,since it knows that the input is gzip-compressed (by looking at the filename extension)and that gzip does not support splitting. This will work, but at the expense of locality: a single map will process the 16 HDFS blocks, most of which will not be local to the map. Also, with fewer maps, the job is less granular, and so may take longer to run.

If the file in our hypothetical example were an LZO file, we would have the same problem since the underlying compression format does not provide a way for a reader to synchronize itself with the stream. A bzip2 file, however, does provide a synchronization marker between blocks (a 48-bit approximation of pi), so it does support splitting. (Table lists whether each compression format supports splitting.)

It is possible to preprocess gzip and LZO files to build an index of split points, effectively making them splittable. See https://issues.apache.org/jira/browse/MAPREDUCE-491 for gzip. For LZO, there is an indexer tool available with the Hadoop LZO libraries, which you can obtain from the site listed in

Which Compression Format Should I Use?

Which compression format you should use depends on your application. Do you want to maximize the speed of your application or are you more concerned about keeping storage costs down? In general, you should try different strategies for your application,and benchmark them with representative datasets to find the best approach.

For large, unbounded files, like logfiles, the options are:

  • Store the files uncompressed.
  • Use a compression format that supports splitting, like bzip2.
  • Split the file into chunks in the application and compress each chunk separatelyusing any supported compression format (it doesn’t matter whether it is splittable).In this case, you should choose the chunk size so that the compressed chunks areapproximately the size of an HDFS block.
  • Use Sequence File, which supports compression and splitting. See “Sequence-File”
  • Use an Avro data file, which supports compression and splitting, just like SequenceFile, but has the added advantage of being readable and writable from manylanguages, not just Java. See “Avro data files”

For large files, you should not use a compression format that does not support splitting on the whole file, since you lose locality and make MapReduce applications very inefficient.

For archival purposes, consider the Hadoop archive format (see “Hadoop Archives”), although it does not support compression.

Using Compression in MapReduce

As described in “Inferring CompressionCodecs using CompressionCodecFactory” , if your input files are compressed, they will be automatically decompressed as they are read by MapReduce, using the filename extension to determine the codec to use.

To compress the output of a MapReduce job, in the job configuration, set the mapred.output.compress property to true and the mapred.output.compression.code property to the classname of the compression codec you want to use, as shown in
.
Application to run the maximum temperature job producing compressed output

We run the program over compressed input (which doesn’t have to use the same compression format as the output, although it does in this example) as follows:

##START# % hadoop MaxTemperatureWithCompression input/ncdc/sample.txt.gz output

Each part of the final output is compressed; in this case, there is a single part:


If you are emitting sequence files for your output, then you can set the mapred.out put.compression.type property to control the type of compression to use. The default is RECORD, which compresses individual records. Changing this to BLOCK, which compresses groups of records, is recommended since it compresses better. .
Compressing map output

Even if your MapReduce application reads and writes uncompressed data, it may benefit from compressing the intermediate output of the map phase. Since the map output is written to disk and transferred across the network to the reducer nodes, by using a fast compressor such as LZO, you can get performance gains simply because the volume of data to transfer is reduced. The configuration properties to enable compression for map outputs and to set the compression format are shown in below.

Compressing map output

Here are the lines to add to enable gzip map output compression in your job:


All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

Hadoop Topics