Installing Hive - Hadoop

In normal use, Hive runs on your workstation and converts your SQL query into a series of MapReduce jobs for execution on a Hadoop cluster. Hive organizes data into tables, which provide a means for attaching structure to data stored in HDFS. Metadata such as table schemas is stored in a database called the metastore.

When starting out with Hive, it is convenient to run the metastore on your local machine. In this configuration, which is the default, the Hive table definitions that you create will be local to your machine, so you can’t share them with other users. We’ll see how to configure a shared remote metastore, which is the norm in production environments, later in “The Metastore” .

Installation of Hive is straightforward. Java 6 is a prerequisite; and on Windows, you will need Cygwin, too. You also need to have the same version of Hadoop installed locally that your cluster is running. Of course, you may choose to run Hadoop locally, either in standalone or pseudo-distributed mode, while getting started with Hive.

Which Versions of Hadoop Does Hive Work With?

Any given release of Hive is designed to work with multiple versions of Hadoop. Generally, Hive works with the latest release of Hadoop, as well as supporting a number of older versions. For example, Hive 0.5.0 is compatible with versions of Hadoop between 0.17.x and 0.20.x (inclusive). You don’t need to do anything special to tell Hive which version of Hadoop you are using, beyond making sure that the hadoop executableis on the path or setting the HADOOP_HOME environment variable.

Download a release at http://hadoop.apache.org/hive/releases.html, and unpack the tarball in a suitable place on your workstation:

It is assumed that you have network connectivity from your workstation to the Hadoop cluster. You can test this before running Hive by installing Hadoop locally and performing some HDFS operations with the hadoop fs command.

The Hive Shell

The shell is the primary way that we will interact with Hive, by issuing commands in HiveQL. HiveQL is Hive’s query language, a dialect of SQL. It is heavily influenced by MySQL, so if you are familiar with MySQL you should feel at home using Hive.

When starting Hive for the first time, we can check that it is working by listing its tables: there should be none. The command must be terminated with a semicolon to tell Hive to execute it:

Like SQL, HiveQL is generally case insensitive (except for string comparisons), so show tables; works equally well here.

For a fresh install, the command takes a few seconds to run since it is lazily creating the metastore database on your machine. (The database stores its files in a directory called metastore_db, which is relative to where you ran the hive command from.) You can also run the Hive shell in non-interactive mode. The -f option runs the commands in the specified file, script.q, in this example:

For short scripts, you can use the -e option to specify the commands inline, in which case the final semicolon is not required:

It’s useful to have a small table of data to test queries against, such as trying out functions in SELECT expressions using literal data (see “Operators and Functions” ). Here’s one way of populating a single row table:

In both interactive and non-interactive mode, Hive will print information to standard error such as the time taken to run a query during the course of operation. You can suppress these messages using the -S option at launch time, which has the effect of only showing the output result for queries:


All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

Hadoop Topics