HiveQL - Hadoop

Hive’s SQL dialect, called HiveQL, does not support the full SQL-92 specification. There are a number of reasons for this. Being a fairly young project, it has not had time to provide the full repertoire of SQL-92 language constructs. More fundamentally, SQL-92 compliance has never been an explicit project goal; rather, as an open source project, features were added by developers to meet their users’ needs. Furthermore,Hive has some extensions that are not in SQL-92, which have been inspired by syntax from other database systems, notably MySQL. In fact, to a first-order approximation, HiveQL most closely resembles MySQL’s SQL dialect.

Some of Hive’s extensions to SQL-92 were inspired by MapReduce, such as multitable inserts (see “Multitable insert” ) and the TRANSFORM, MAP, and REDUCE clauses (see “MapReduce Scripts” ). It turns out that some SQL-92 constructs that are missing from HiveQL are easy to work around using other language features, so there has not been much pressure to implement them. For example, SELECT statements do not (at the time of writing) support a HAVING clause in HiveQL, but the same result can be achieved by adding a subquery in the FROM clause (see “Subqueries”).

This chapter does not provide a complete reference to HiveQL; for that, see the Hive documentation at http://wiki.apache.org/hadoop/Hive/LanguageManual. Instead, we focus on commonly used features and pay particular attention to features that diverge from SQL-92, or popular databases like MySQL. Table provides a high-level comparison of SQL and HiveQL.

Table . A high-level comparison of SQL and HiveQL

A high-level comparison of SQL and HiveQL

Data Types

Hive supports both primitive and complex data types. Primitives include numeric, boolean, and string types. The complex data types include arrays, maps, and structs.

Hive’s data types are listed in Table. Note that the literals shown are those used from within HiveQL; they are not the serialized form used in the table’s storage format(see “Storage Formats” ).

Hive data types:

Hive-Datatypes

The literal forms for arrays, maps, and structs are provided as functions. That is, array(), map(), and struct() are built-in Hive functions.


  1. From Hive 0.6.0. The columns are named col1, col2, col3, etc.

Primitive types

Compared to traditional databases, Hive supports only a small set of primitive data types. There is currently no support for temporal types (dates and times), although there are functions for converting Unix timestamps (stored as integers) to strings, which makes most common date operations tractable using Hive.

Hive’s primitive types correspond roughly to Java’s, although some names are influenced by MySQL’s type names (some of which, in turn, overlap with SQL-92). There are four signed integral types: TINYINT, SMALLINT, INT, and BIGINT, which are equivalent to Java’s byte, short, int, and long primitive types, respectively; they are 1-byte, 2-byte, 4-byte, and 8-byte signed integers.

Hive’s floating-point types, FLOAT and DOUBLE, correspond to Java’s float and double, which are 32-bit and 64-bit floating point numbers. Unlike some databases, there is no option to control the number of significant digits or decimal places stored for floating point values.

Hive supports a BOOLEAN type for storing true and false values.

There is a single Hive data type for storing text, STRING, which is a variable-length character string. Hive’s STRING type is like VARCHAR in other databases, although there is no declaration of the maximum number of characters to store with STRING. (The theoretical maximum size STRING that may be stored is 2GB, although in practice it may be inefficient to materialize such large values. Sqoop has large object support, see “Importing Large Objects” .)

Conversions

Primitive types form a hierarchy, which dictates the implicit type conversions that Hive will perform. For example, a TINYINT will be converted to an INT, if an expression expects an INT; however, the reverse conversion will not occur and Hive will return an error unless the CAST operator is used.

The implicit conversion rules can be summarized as follows. Any integral numeric type can be implicitly converted to a wider type. All the integral numeric types, FLOAT, and (perhaps surprisingly) STRING can be implicitly converted to DOUBLE. TINYINT, SMALL INT, and INT can all be converted to FLOAT. BOOLEAN types cannot be converted to any other type.

You can perform explicit type conversion using CAST. For example, CAST('1' AS INT) will convert the string '1' to the integer value 1. If the cast fails—as it does in CAST('X' AS INT), for example—then the expression returns NULL.

Complex types

Hive has three complex types: ARRAY, MAP, and STRUCT. ARRAY and MAP are like their namesakes in Java, while a STRUCT is a record type which encapsulates a set of named fields. Complex types permit an arbitrary level of nesting. Complex type declarations must specify the type of the fields in the collection, using an angled bracket notation, as illustrated in this table definition which has three columns, one for each complex type:

If we load the table with one row of data for ARRAY, MAP, and STRUCT shown in the “Literal examples” column in Table (we’ll see the file format needed to do this in “Storage Formats” ), then the following query demonstrates the field accessor operators for each type:

Operators and Functions

The usual set of SQL operators is provided by Hive: relational operators (such as x = 'a' for testing equality, x IS NULL for testing nullity, x LIKE 'a%' for pattern matching), arithmetic operators (such as x + 1 for addition), and logical operators (such as x OR y for logical OR). The operators match those in MySQL, which deviates from SQL-92 since is logical OR, not string concatenation. Use the concat function for the latterin both MySQL and Hive.

Hive comes with a large number of built-in functions too many to list here divided into categories including mathematical and statistical functions, string functions, date functions (for operating on string representations of dates), conditional functions, aggregate functions, and functions for working with XML (using the xpath function) and JSON.

You can retrieve a list of functions from the Hive shell by typing SHOW FUNCTIONS. To get brief usage instructions for a particular function, use the DESCRIBE command:

In the case when there is no built-in function that does what you want, you can write your own; see “User-Defined Functions”


All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

Hadoop Topics