HCatalog Reader Writer - HCatalog

What isHCatalog Reader Writer?

HCatalog has an inbuilt data transfer API for parallel input and output without using MapReduce. This API reads data from Hadoop cluster and writes data into it as a basic storage abstraction of tables and rows.Data Transfer API contains mainly three classes; those are −
  • HCatReader –It Reads data from a Hadoop cluster.
  • HCatWriter – It Writes data into a Hadoop cluster.
  • DataTransferFactory – It Generates reader and writer instances.
This API is suitable for master-slave node setup. Let’s discuss more features of HCatReader and HCatWriter.

What is HCatReader?

HCatReader is an abstract class internal to HCatalog and abstracts away the complexities of the underlying system from where the records are to be retrieved.

Sr.No.

Method Name & Description

1

Public abstract ReaderContext prepareRead() throws HCatException

This should be called at master node to obtain ReaderContext which then should be serialized and sent slave nodes.

2

Public abstract Iterator <HCatRecorder> read() throws HCaException

This should be called at slaves nodes to read HCatRecords.

3

Public Configuration getConf()

It will return the configuration class object.

The HCatReader class is used to read the data from HDFS. Reading is a dual step process as the first step contains the master node of an external system and second step contains multiple slave nodes.
Reads are done on a ReadEntity. Before reading it is must to define a ReadEntity from which to read with the help of ReadEntity.Builder. You can specify a database name, table name, partition, and filter string. For example –
The above code snippet defines a ReadEntity object (“entity”), which includes a table named mytbl in a database named mydb, which can be used to read all the rows of this table. Note that this table must exist in HCatalog prior to the start of this operation.
After defining a ReadEntity, you obtain an instance of HCatReader using the ReadEntity and cluster configuration –
The next step is to obtain a ReaderContext from reader as follows –

What is HCatWriter?

HCatWriter is internal to HCatalog which facilitate writing to HCatalog from external systems. Don't try to instantiate this directly. Instead, use DataTransferFactory.

Sr.No.

Method Name & Description

1

Public abstract WriterContext prepareRead() throws HCatException

External system should invoke this method exactly once from a master node. It returns aWriterContext. This should be serialized and sent to slave nodes to constructHCatWriterthere.

2

Public abstract void write(Iterator<HCatRecord> recordItr) throws HCaException

This method should be used at slave nodes to perform writes. The recordItr is an iterator object that contains the collection of records to be written into HCatalog.

3

Public abstract void abort(WriterContext cntxt) throws HCatException

This method should be called at the master node. The primary purpose of this method is to do cleanups in case of failures.

4

public abstract void commit(WriterContext cntxt) throws HCatException

This method should be called at the master node. The purpose of this method is to do metadata commit.

It contains two-step process of writing in which the first step occurs on the master node and the second step occurs in parallel on slave nodes.
Writes are done on a WriteEntity which can be constructed in a fashion similar to reads –
The above code creates a WriteEntity object entity which can be used to write into a table named mytbl in the database mydb.
Once you create a WriteEntity, next step involves creating a WriterContext –
Above mentioned steps are happened with the master node. The master node then serializes the WriterContext object and makes it available to all the slaves.
On slave nodes, you need to obtain an HCatWriter using WriterContext as follows –
Then, the writer takes an iterator as the argument for the write method –
The writer then calls getNext() on this iterator in a loop and writes out all the records attached to the iterator.
The TestReaderWriter.java file is used to test the HCatreader and HCatWriter classes. Below mentioned program describes how to use HCatReader and HCatWriter API to read data from a source file and subsequently write it onto a destination file.
The above program reads the data from the HDFS in the form of records and writes the record data into mytable

All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

HCatalog Topics