Comparison with Traditional Databases - Hadoop

While Hive resembles a traditional database in many ways (such as supporting an SQL interface), its HDFS and MapReduce underpinnings mean that there are a number of architectural differences that directly influence the features that Hive supports, which in turn affects the uses that Hive can be put to.

Schema on Read Versus Schema on Write

In a traditional database, a table’s schema is enforced at data load time. If the data being loaded doesn’t conform to the schema, then it is rejected. This design is sometimes called schema on write, since the data is checked against the schema when it is written into the database.

Hive, on the other hand, doesn’t verify the data when it is loaded, but rather when a query is issued. This is called schema on read. There are trade-offs between the two approaches. Schema on read makes for a very fast initial load, since the data does not have to be read, parsed, and serialized to disk in the database’s internal format. The load operation is just a file copy or move. It is more flexible, too: consider having two schemas for the same underlying data, depending on the analysis being performed. (This is possible in Hive using external tables, see “Managed Tables and External Tables” .).

Schema on write makes query time performance faster, since the database can index columns and perform compression on the data. The trade-off, however, is that it takes longer to load data into the database. Furthermore, there are many scenarios where the schema is not known at load time, so there are no indexes to apply, since the queries have not been formulated yet. These scenarios are where Hive shines.

Updates, Transactions, and Indexes

Updates, transactions, and indexes are mainstays of traditional databases. Yet, until recently, these features have not been considered a part of Hive’s feature set. This is because Hive was built to operate over HDFS data using MapReduce, where full-table scans are the norm and a table update is achieved by transforming the data into a new table. For a data warehousing application that runs over large portions of the dataset,this works well.

However, there are workloads where updates (or insert appends, at least) are needed, or where indexes would yield significant performance gains. On the transactions front, Hive doesn’t define clear semantics for concurrent access to tables, which means applications need to build their own application-level concurrency or locking mechanism.

The Hive team is actively working on improvements in all these areas. Change is also coming from another direction: HBase integration. HBase ( HBase Chapter ) has different storage characteristics to HDFS, such as the ability to do row updates and column indexing, so we can expect to see these features used by Hive in future releases. HBase integration with Hive is still in the early stages of development;

you can find out more at http://wiki.apache.org/hadoop/Hive/HBaseIntegration.


All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

Hadoop Topics