how to use Hive to run a query - Hadoop

Let’s see how to use Hive to run a query on the weather dataset we explored in earlier chapters. The first step is to load the data into Hive’s managed storage. Here we’ll have Hive use the local filesystem for storage; later we’ll see how to store tables in HDFS.

Just like an RDBMS, Hive organizes its data into tables. We create a table to hold the weather data using the CREATE TABLE statement:

The first line declares a records table with three columns: year, temperature, and quality. The type of each column must be specified, too: here the year is a string, while the other two columns are integers.

So far, the SQL is familiar. The ROW FORMAT clause, however, is particular to HiveQL. What this declaration is saying is that each row in the data file is tab-delimited text.

Hive expects there to be three fields in each row, corresponding to the table columns, with fields separated by tabs, and rows by newlines. Next we can populate Hive with the data. This is just a small sample, for exploratory purposes:

Running this command tells Hive to put the specified local file in its warehouse directory. This is a simple filesystem operation. There is no attempt, for example, to parse the file and store it in an internal database format, since Hive does not mandate any particular file format. Files are stored verbatim: they are not modified by Hive.

In this example, we are storing Hive tables on the local filesystem ( is set to its default value of file:///). Tables are stored as directories under Hive’s warehouse directory, which is controlled by the hive.metastore.warehouse.dir, and defaults to /user/hive/warehouse.

Thus, the files for the records table are found in the /user/hive/warehouse/records directory on the local filesystem: % ls /user/hive/warehouse/record/ sample.txt

In this case, there is only one file, sample.txt, but in general there can be more, and Hive will read all of them when querying the table.

The OVERWRITE keyword in the LOAD DATA statement tells Hive to delete any existing files in the directory for the table. If it is omitted, then the new files are simply added to the table’s directory (unless they have the same names, in which case they replace the old files).

Now that the data is in Hive, we can run a query against it:

This SQL query is unremarkable. It is a SELECT statement with a GROUP BY clause for grouping rows into years, which uses the MAX() aggregate function to find the maximum temperature for each year group. But the remarkable thing is that Hive transforms this query into a MapReduce job, which it executes on our behalf, then prints the results to the console. There are some nuances such as the SQL constructs that Hive supports and the format of the data that we can query and we shall explore some of these in this chapter but it is the ability to execute SQL queries against our raw data that gives Hive its power.

All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd Protection Status

Hadoop Topics