Apache Tajo Table Management - Apache Tajo

How tables are managed in Apache Tajo?

The logical view of the data source is defined as table. The table consists of various properties like logical schema, partitions, URL etc. A Tajo table can be a directory in HDFS, a single file, one HBase table, or a RDBMS table.

The types of tables supported by Apache Tajo are:

  • external table
  • internal table

What is External Table in Apache Tajo?

External table needs the location property when the table is created. For instance, if the data is already there as Text/JSON files or HBase table, it can be registered as Tajo external table.

The following query is an example of external table creation.

Here,

  • External keyword − This is used to create an external table. This helps to create a table in the specified location.
  • Sample refers to the table name.
  • Location − It is a directory for HDFS,Amazon S3, HBase or local file system. To assign a location property for directories, use the below URI examples −
    • HDFS − hdfs://localhost:port/path/to/table
    • Amazon S3 − s3://bucket-name/table
    • local file system − file:///path/to/table
    • Openstack Swift − swift://bucket-name/table

External Table Properties

An external table has the following properties −

  • TimeZone − Users can specify a time zone for reading or writing a table.
  • Compression format − Used to make data size compact. For instance, the text/json file uses compression.codec property.

What is Internal Table in Apache Tajo?

A Internal table is also called an Managed Table. It is created in a pre-defined physical location called the Tablespace.

Syntax

By default, Tajo uses “tajo.warehouse.directory” located in “conf/tajo-site.xml” . Tablespace configuration is used to assign new location for the table.

Tablespace

The locations in the storage system are defined by Tablespace. It is supported for only internal tables. Tablespaces are accessed by their names. Each tablespace can use a different storage type. If the tablespace is not specified then, Tajo uses the default tablespace in the root directory.

Tablespace Configuration

Copy the “conf/tajo-site.xml.template” file and rename it to “storagesite.json”. This file will act as a configuration for Tablespaces. Tajo data formats uses the following configuration −

HDFS Configuration

HBase Configuration

Text File Configuration

Tablespace Creation

Tajo’s internal table records can be accessed from another table only. It can be configured with tablespace.

Syntax

Here,

  • IF NOT EXISTS − This avoids an error if the same table has not been created already.
  • TABLESPACE − This clause is used to assign the tablespace name.
  • Storage type − Tajo data supports formats like text,JSON,HBase,Parquet,Sequencefile and ORC.
  • AS select statement − Select records from another table.

Configure Tablespace

Start your Hadoop services and open the file “conf/storage-site.json”, then add the following changes −

Here, Tajo will refer to the data from HDFS location and space1 is the tablespace name. If Hadoop isnot started then tablespace cant be registered.

Query

The above query creates a table named “table1” and “space1” refers to the tablespace name.

What are the different Data formats supported by Apache Tajo?

The following are some of the data formats supported by Apache Tajo:

Text

A character-separated values’ plain text file represents a tabular data set consisting of rows and columns. Each row is a plain text line.

Creating Table

Here, “customers.csv” file refers to a comma separated value file located in the Tajo installation directory.

To create internal table using text format, use the following query −

In the above query, tablespace isnot assigned ad so it will take Tajo’s default tablespace.

Properties

A text file format has the following properties −

  • text.delimiter − This is a delimiter character. Default is ‘|’.
  • compression.codec − This is a compression format. By default, it is disabled. The settings can be changed by using specified algorithm.
  • timezone − The table used for reading or writing.
  • text.error-tolerance.max-num − The maximum number of tolerance levels.
  • text.skip.headerlines − The number of header lines per skipped.
  • text.serde − This is serialization property.

JSON

Apache Tajo supports JSON format for querying data. Tajo treats a JSON object as SQL record. One object equals one row in a Tajo table. Let’s consider “array.json” as follows −

After the file is created, switch to the Tajo shell and type the following query to create a table using the JSON format.

Query

It is to be noted that the file data must match with the table schema. Otherwise, omit the column names and use * which doesn’t require columns list.

To create an internal table, use the following query −

Parquet

Parquet is a columnar storage format. Tajo uses Parquet format for easy, fast and efficient access.

Table creation

The following query is an example for table creation −

Parquet file format has the following properties −

  • parquet.block.size − size of a row group being buffered in memory.
  • parquet.page.size − The page size is for compression.
  • parquet.compression − The compression algorithm used to compress pages.
  • parquet.enable.dictionary − The boolean value is to enable/disable dictionary encoding.

RCFile

RCFile is the Record Columnar File. It consists of binary key/value pairs.

Table creation

The following query is an example for table creation −

RCFile has the following properties −

  • rcfile.serde − custom deserializer class.
  • compression.codec − compression algorithm.
  • rcfile.null − NULL character.

SequenceFile

SequenceFile is a basic file format in Hadoop which consists of key/value pairs.

Table creation

The following query is an example for table creation −

This sequence file has Hive compatibility. This can be written in Hive as,

ORC

ORC (Optimized Row Columnar) is a columnar storage format from Hive.

Table creation

The following query is an example for table creation −

The ORC format has the following properties −

  • orc.max.merge.distance − ORC file is read, it merges when the distance is lower.
  • orc.stripe.size − This is the size of each stripe.
  • orc.buffer.size − The default is 256KB.
  • orc.rowindex.stride − This is the ORC index stride in number of rows.

All rights reserved © 2020 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

Apache Tajo Topics