HCatalog Input Output Format - HCatalog

What are HCatalog Input Output Format Interfaces?

The HCatInputFormat and HCatOutputFormat interfaces are used to read data from HDFS and after processing, write the resultant data into HDFS using MapReduce job. Let’s discuss about these Input and Output format interfaces.

HCatInputFormat

The HCatInputFormat is used with MapReduce jobs to read data from HCatalog-managed tables. HCatInputFormat exposes a Hadoop 0.20 MapReduce API for reading data as if it had been published to a table.

Sr.No.

Method Name & Description

1

public static HCatInputFormat setInput(Job job, String dbName, String tableName)throws IOException

Set inputs to use for the job. It queries the metastore with the given input specification and serializes matching partitions into the job configuration for MapReduce tasks.

2

public static HCatInputFormat setInput(Configuration conf, String dbName, String tableName) throws IOException

Set inputs to use for the job. It queries the metastore with the given input specification and serializes matching partitions into the job configuration for MapReduce tasks.

3

public HCatInputFormat setFilter(String filter)throws IOException

Set a filter on the input table.

4

public HCatInputFormat setProperties(Properties properties) throws IOException

Set properties for the input format.

The HCatInputFormat API includes the following methods −
  • setInput
  • setOutputSchema
  • getTableSchema
To read data with HCatInputFormat then first instantiate an InputJobInfo with the necessary information from the table being read and then call setInput with the InputJobInfo.
You can use the setOutputSchema method to include a projection schema, to specify the output fields. Unspecified schema will return the all columns in the table. To determine the table schema for a specified input table use getTableSchema method.

HCatOutputFormat

To write data to HCatalog-managed tables, HCatOutputFormat is used with MapReduce jobs. HCatOutputFormat exposes a Hadoop 0.20 MapReduce API for writing data to a table. When a MapReduce job uses HCatOutputFormat to write output, the default OutputFormat configured for the table is used and the new partition is published to the table after the job completes.

Sr.No.

Method Name & Description

1

public static void setOutput (Configuration conf, Credentials credentials, OutputJobInfo outputJobInfo) throws IOException

Set the information about the output to write for the job. It queries the metadata server to find the StorageHandler to use for the table. It throws an error if the partition is already published.

2

public static void setSchema (Configuration conf, HCatSchema schema) throws IOException

Set the schema for the data being written out to the partition. The table schema is used by default for the partition if this is not called.

3

public RecordWriter <WritableComparable<?>, HCatRecord > getRecordWriter (TaskAttemptContext context)throws IOException, InterruptedException

Get the record writer for the job. It uses the StorageHandler's default OutputFormat to get the record writer.

4

public OutputCommitter getOutputCommitter (TaskAttemptContext context) throws IOException, InterruptedException

Get the output committer for this output format. It ensures that the output is committed correctly.

The HCatOutputFormat API includes the following methods −
  • setOutput
  • setSchema
  • getTableSchema
SetOutput is the first call on the HCatOutputFormat and if any other call will be set as saying that the output format is not initialized.
The written data for the is specified by the setSchema method. You must call this method, providing the schema of data you are writing. If your data has the same schema as the table schema, you can use HCatOutputFormat.getTableSchema() to get the table schema and then pass that along to setSchema().

Example

The following MapReduce program reads data from one table which it assumes to have an integer in the second column ("column 1"), and counts how many instances of each distinct value it finds. That is, it does the equivalent of "select col1, count(*) from $table group by col1;".
If you consider the values in the second column are {1, 1, 1, 3, 3, 5}, then the program will produce the below mentioned output of values and counts –
Let us now take a look at the program code –
Before compiling the above program, you have to download some jars and add those to the classpath for this application. You need to download all the Hive jars and HCatalog jars (HCatalog-core-0.5.0.jar, hive-metastore-0.10.0.jar, libthrift-0.7.0.jar, hive-exec-0.10.0.jar, libfb303-0.7.0.jar, jdo2-api-2.3-ec.jar, slf4j-api-1.6.1.jar).
Use the following commands to copy those jar files from local to HDFS and add those to the classpath.
Use the following command to compile and execute the given program.
Let’s checkout your output directory (hdfs: user/tmp/hive) for the output (part_0000, part_0001).

All rights reserved © 2020 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

HCatalog Topics