Running Hive - Hadoop

In this section, we look at some more practical aspects of running Hive, including how to set up Hive to run against a Hadoop cluster and a shared metastore. In doing so, we’ll see Hive’s architecture in some detail.

Configuring Hive

Hive is configured using an XML configuration file like Hadoop’s. The file is called hive-site.xml and is located in Hive’s conf directory. This file is where you can set properties that you want to set every time you run Hive. The same directory contains hivedefault.xml, which documents the properties that Hive exposes and their default values.

You can override the configuration directory that Hive looks for in hive-site.xml by passing the --config option to the hive command:

Note that this option specifies the containing directory, not hive-site.xml itself. It can be useful if you have multiple site files for different clusters, say that you switch between on a regular basis. Alternatively, you can set the HIVE_CONF_DIR environment variable to the configuration directory, for the same effect.

The hive-site.xml is a natural place to put the cluster connection details: you can specify the filesystem and jobtracker using the usual Hadoop properties, fs.default.name and mapred.job.tracker If notset, they default to the local filesystem and the local (in-process) job runner just like they do in Hadoop which is very handy when trying out Hive on small trial datasets.

Metastore configuration settings (covered in “The Metastore” ) are commonly found in hive-site.xml, too.

Hive also permits you to set properties on a per-session basis, by passing the -hiveconf option to the hive command. For example, the following command sets the cluster (to a pseudo-distributed cluster) for the duration of the session:

If you plan to have more than one Hive user sharing a Hadoop cluster, then you need to make the directories that Hive uses writable by all users. The following commands will create the directories and set their
permissions appropriately:

If all users are in the same group, then permissions g+w are sufficient on the warehouse directory.

You can change settings from within a session, too, using the SET command. This is useful for changing Hive or MapReduce job settings for a particular query. For example, the following command ensures buckets are populated according to the table definition (see “Buckets”):

To see the current value of any property, use SET with just the property name:

By itself, SET will list all the properties (and their values) set by Hive. Note that the list will not include Hadoop defaults, unless they have been explicitly overridden in one of the ways covered in this section. Use SET -v to list all the properties in the system, including Hadoop defaults.

There is a precedence hierarchy to setting properties. In the following list, lower numbers take precedence over higher numbers:

  1. The Hive SET command
  2. The command line -hiveconf option
  3. hive-site.xml
  4. hive-default.xml
  5. hadoop-site.xml
  6. hadoop-default.xml

Logging

You can find Hive’s error log on the local file system at /tmp/$USER/hive.log. It can be very useful when trying to diagnose configuration problems or other types of error.

Hadoop’s MapReduce task logs are also a useful source for troubleshooting; see “Hadoop User Logs” for where to find them. The logging configuration is in conf/hive-log4j.properties, and you can edit this file tochange log levels and other logging-related settings. Often though, it’s more convenient to set logging configuration for the session. For example, the following handy invocation will send debug messages to the console:

Hive Services

The Hive shell is only one of several services that you can run using the hive command. You can specify the service to run using the --service option. Type hive --service help to get a list of available service names; the most useful are described below.

cli

The command line interface to Hive (the shell). This is the default service.

hiveserver

Runs Hive as a server exposing a Thrift service, enabling access from a range of clients written in different languages. Applications using the Thrift, JDBC, and ODBC connectors need to run a Hive server to communicate with Hive. Set the HIVE_PORT environment variable to specify the port the server will listen on (defaults to 10,000).

hwi

The Hive Web Interface. See “The Hive Web Interface (HWI)” .

jar

The Hive equivalent to hadoop jar, a convenient way to run Java applications that includes both Hadoop and Hive classes on the classpath.

metastore

By default, the metastore is run in the same process as the Hive service. Using this service, it is possible to run the metastore as a standalone (remote) process. Set the METASTORE_PORT environment variable to specify the port the server will listen on.

The Hive Web Interface (HWI)

As an alternative to the shell, you might want to try Hive’s simple web interface. Start it using the following commands:

(You only need to set the ANT_LIB environment variable if Ant’s library is not found in /opt/ant/lib on your system.) Then navigate to http://localhost:9999/hwi in your browser. From there, you can browse Hive database schemas and create sessions for issuing commands and queries.

It’s possible to run the web interface as a shared service to give users within an organization access to Hive without having to install any client software

.Hive clients

If you run Hive as a server (hive --service hiveserver), then there are a number of different mechanisms for connecting to it from applications. The relationship between Hive clients and Hive services is illustrated in Figure.

Hive Architecture

Thrift Client

The Hive Thrift Client makes it easy to run Hive commands from a wide range of programming languages. Thrift bindings for Hive are available for C++, Java, PHP, Python, and Ruby. They can be found in the src/service/src subdirectory in the Hive distribution.

JDBC Driver

Hive provides a Type 4 (pure Java) JDBC driver, defined in the class org.apache.hadoop.hive.jdbc.HiveDriver. When configured with a JDBC URI of the form jdbc:hive://host:port/dbname, a Java application will connect to a Hive server running in a separate process at the given host and port. (The driver makes calls to an interface implemented by the Hive Thrift Client using the Java Thrift bindings.) At the time of writing, default is the only database name supported.

You may alternatively choose to connect to Hive via JDBC in embedded mode using the URI jdbc:hive://. In this mode, Hive runs in the same JVM as the application invoking it, so there is no need to launch it as a standalone server since it does not use the Thrift service or the Hive Thrift Client.

The JDBC driver is still in development, and in particular it does not support the full JDBC API.

ODBC Driver

The Hive ODBC Driver allows applications that support the ODBC protocol to connect to Hive. (Like the JDBC driver, the ODBC driver uses Thrift to communicate with the Hive server.) The ODBC driver is still in development, so you should refer to the latest instructions on the Hive wiki for how to build and run it.

There are more details on using these clients on the Hive wiki at http://wiki.apache.org/ hadoop/Hive/HiveClient.

The Metastore

The metastore is the central repository of Hive metadata. The metastore is divided into two pieces: a service and the backing store for the data. By default, the metastore service runs in the same JVM as the Hive service and contains an embedded Derby database instance backed by the local disk. This is called the embedded metastore configuration . Using an embedded metastore is a simple way to get started with Hive; however, only one embedded Derby database can access the database files on disk at any one time, which means you can only have one Hive session open at a time that shares the same metastore. Trying to start a second session gives the error: Failed to start database 'metastore_db' when it attempts to open a connection to the metastore.

Metastrore configurations

The solution to supporting multiple sessions (and therefore multiple users) is to use a standalone database. This configuration is referred to as a local metastore, since the metastore service still runs in the same process as the Hive service, but connects to a database running in a separate process, either on the same machine or on a remote machine. Any JDBC-compliant database may be used by setting the javax.jdo.option.* configuration properties listed in Table .

Metastore configuration properties

MySQL is a popular choice for the stand alone metastore. In this case, javax.jdo.option.ConnectionURL is set to jdbc:mysql://host/dbname?createDatabaseIf NotExist=true, and javax.jdo.option.ConnectionDriverName is set to com.mysql.jdbc.Driver. (The user name and password should be set, too, of course.)

The JDBC driver JAR file for MySQL (Connector/J) must be on Hive’s classpath, which is simply achieved by placing it in Hive’s lib directory.

The properties have the javax.jdo prefix since the metastore implementation uses the Java Data Objects (JDO) API for persisting Java objects. It uses the DataNucleus implementation of JDO.

Going a step further, there’s another metastore configuration called a remote metastore, where one or more metastore servers run in separate processes to the Hive service. This brings better manageability and security, since the database tier can be completely firewalled off, and the clients no longer need the database credentials.

A Hive service is configured to use a remote metastore by setting hive.meta store.local to false, and hive.metastore.uris to the metastore server URIs, separated by commas if there is more than one. Metastore server URIs are of the form thrift:// host:port, where the port corresponds to the one set by METASTORE_PORT when starting the metastore server (see “Hive Services” ).

Table . Important metastore configuration properties


All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

Hadoop Topics