Querying Data - Hadoop

This section discusses how to use various forms of the SELECT statement to retrieve data from Hive.

Sorting and Aggregating

Sorting data in Hive can be achieved by use of a standard ORDER BY clause, but there is a catch. ORDER BY produces a result that is totally sorted, as expected, but to do so it sets the number of reducers to one, making it very inefficient for large datasets. (Hopefully, a future release of Hive will employ the techniques described in “Total Sort” to support efficient parallel sorting.) When a globally sorted result is not required and in many cases it isn’t then you can use Hive’s nonstandard extension, SORT BY instead. SORT BY produces a sorted file per reducer.

In some cases, you want to control which reducer a particular row goes to, typically so you can perform some subsequent aggregation. This is what Hive’s DISTRIBUTE BY clause does. Here’s an example to sort the weather dataset by year and temperature, in such a way to ensure that all the rows for a given year end up in the same reducer partition:

This is a reworking in Hive of the discussion in “Secondary Sort” .

A follow-on query (or a query that nested this query as a subquery, see “Subqueries” ) would be able to use the fact that each year’s temperatures were grouped and sorted (in descending order) in the same file.

If the columns for SORT BY and DISTRIBUTE BY are the same, you can use CLUSTER BY as a shorthand for specifying both.

MapReduce Scripts

Using an approach like Hadoop Streaming, the TRANSFORM, MAP, and REDUCE clauses make it possible to invoke an external script or program from Hive. Suppose we want to use a script to filter out rows that don’t meet some condition, such as the script in Example , which removes poor quality readings.

Example . Python script to filter out poor quality weather records

Before running the query, we need to register the script with Hive. This is so Hive knows to ship the file to the Hadoop cluster (see “Distributed Cache” ). The query itself streams the year, temperature, and quality fields as a tab-separated line to the is_good_quality.py script, and parses the tab-separated output into year and temperature fields to form the output of the query.

This example has no reducers. If we use a nested form for the query, we can specify a map and a reduce function. This time we use the MAP and REDUCE keywords, but SELECT TRANSFORM in both cases would have the same result. The source for the max_temperature_ reduce.py script is shown in Example:

Joins

One of the nice things about using Hive, rather than raw MapReduce, is that it makes performing commonly used operations very simple. Join operations are a case in point, given how involved they are to implement in MapReduce (“Joins” ).

Inner joins

The simplest kind of join is the inner join, where each match in the input tables results in a row in the output. Consider two small demonstration tables: sales, which lists the names of people and the ID of the item they bought; and things, which lists the item ID and its name:

We can perform an inner join on the two tables as follows:

The table in the FROM clause (sales) is joined with the table in the JOIN clause (things), using the predicate in the ON clause. Hive only supports equijoins, which means that only equality can be used in the join predicate, which here matches on the id column in both tables.

Some databases, such as MySQL and Oracle, allow you to list the join tables in the FROM clause and specify the join condition in the WHERE clause of a SELECT statement. However, this syntax is not supported in Hive, so the following fails with a parse error:

Hive only allows a single table in the FROM clause, and joins must follow the SQL-92 JOIN clause syntax. In Hive, you can join on multiple columns in the join predicate by specifying a series of expressions, separated by AND keywords. You can also join more than two tables by supplying additional JOIN...ON... clauses in the query. Hive is intelligent about trying to minimize the number of MapReduce jobs to perform the joins.

A single join is implemented as a single MapReduce job, but multiple joins can be performed in less than one MapReduce job per join if the same column is used in the join condition.† You can see how many MapReduce jobs Hive will use for any particular query by prefixing it with the EXPLAIN keyword:

The EXPLAIN output includes many details about the execution plan for the query, including the abstract syntax tree, the dependency graph for the stages that Hive will execute, and information about each stage.

Stages may be MapReduce jobs or operations such as file moves. For even more detail, prefix the query with EXPLAIN EXTENDED.

Hive currently uses a rule-based query optimizer for determining how to execute a query, but it’s likely that in the future a cost-based optimizer will be added.

Outer joins

Outer joins allow you to find nonmatches in the tables being joined. In the current example, when we performed an inner join, the row for Ali did not appear in the output, since the ID of the item she purchased was not present in the things table. If we change the join type to LEFT OUTER JOIN, then the query will return a row for every row in the left table (sales), even if there is no corresponding row in the table it is being joined to(things):

The order of the tables in the JOIN clauses is significant: it’s generally best to have the largest table last, but see http://wiki.apache.org/hadoop/Hive/LanguageManual/Joins for more details, including how to give hints
to the Hive planner.

Notice that the row for Ali is now returned, and the columns from the things table are NULL, since there is no match.

Hive supports right outer joins, which reverses the roles of the tables relative to the left join. In this case, all items from the things table are included, even those that weren’t purchased by anyone (a scarf):

Finally, there is a full outer join, where the output has a row for each row from both tables in the join:

Semi joins

Hive doesn’t support IN subqueries (at the time of writing), but you can use a LEFT SEMI JOIN to do the same thing.

Consider this IN subquery, which finds all the items in the things table that are in the sales table:

There is a restriction that we must observe for LEFT SEMI JOIN queries: the right table (sales) may only appear in the ON clause. It cannot be referenced in a SELECT expression, for example.

Map joins

If one table is small enough to fit in memory, then Hive can load the smaller table into memory to perform the join in each of the mappers. The syntax for specifying a map join is a hint embedded in an SQL C-style comment:

The job to execute this query has no reducers, so this query would not work for a RIGHT or FULL OUTER JOIN, since absence of matching can only be detected in an aggregating (reduce) step across all the inputs.

Map joins can take advantage of bucketed tables (“Buckets” ), since a mapper working on a bucket of the left table only needs to load the corresponding buckets of the right table to perform the join. The syntax for the join is the same as for the in-memory case above; however, you also need to enable the optimization with:

SET hive.optimize.bucketmapjoin=true;

Subqueries

A subquery is a SELECT statement that is embedded in another SQL statement. Hive has limited support for subqueries, only permitting a subquery in the FROM clause of a SELECT statement.

Other databases allow subqueries almost anywhere that an expression is valid, such as in the list of values to retrieve from a SELECT statement or in the WHERE clause. Many uses of subqueries can be rewritten as joins, so if you find yourself writing a subquery where Hive does not support it, then see if it can be expressed as a join. For example, an IN subquery can be written as a semi join, or an inner join (see “Joins” ).

The following query finds the mean maximum temperature for every year and weather station:

The subquery is used to find the maximum temperature for each station/date combination, then the outer query uses the AVG aggregate function to find the average of the maximum temperature readings for each station/date combination.

The outer query accesses the results of the subquery like it does a table, which is why the subquery must be given an alias (mt). The columns of the subquery have to be given unique names so that the outer query can refer to them.

Views

A view is a sort of “virtual table” that is defined by a SELECT statement. Views can be used to present data to users in a different way to the way it is actually stored on disk.

Often, the data from existing tables is simplified or aggregated in a particular way that makes it convenient for further processing. Views may also be used to restrict users’ access to particular subsets of tables that they are authorized to see.

In Hive, a view is not materialized to disk when it is created; rather, the view’s SELECT statement is executed when the statement that refers to the view is run. If a view performs extensive transformations on the base tables, or is used frequently, then you may choose to manually materialize it by creating a new table that stores the contents of the view (see “CREATE TABLE...AS SELECT”).

We can use views to rework the query from the previous section for finding the mean maximum temperature for every year and weather station. First, let’s create a view for valid records, that is, records that have a particular quality value:

When we create a view, the query is not run; it is simply stored in the metastore. Views are included in the output of the SHOW TABLES command, and you can see more details about a particular view, including the query used to define it, by issuing the DESCRIBE EXTENDED view_name command.

Next, let’s create a second view of maximum temperatures for each station and year.

It is based on the valid_records view:

In this view definition, we list the column names explicitly. We do this since the maximum temperature column is an aggregate expression, and otherwise Hive would create a column alias for us (such as _c2). We could equally well have used an AS clause in the SELECT to name the column.

With the views in place, we can now use them by running a query:

The result of the query is the same as running the one that uses a subquery, and, in particular, the number of MapReduce jobs that Hive creates is the same for both: two in each case, one for each GROUP BY. This example shows that Hive can combine a query on a view into a sequence of jobs that is equivalent to writing the query without using a view. In other words, Hive won’t needlessly materialize a view even at execution time.

Views in Hive are read-only, so there is no way to load or insert data into an underlying base table via a view.


All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

Hadoop Topics