Working with Imported Data - Hadoop

Once data has been imported to HDFS, it is now ready for processing by custom Map- Reduce programs. Text-based imports can be easily used in scripts run with Hadoop Streaming or in MapReduce jobs run with the default TextInputFormat.

To use individual fields of an imported record, though, the field delimiters (and any escape/enclosing characters) must be parsed and the field values extracted and converted to the appropriate data types. For example, the id of the “sprocket” widget is represented as the string "1" in the text file, but should be parsed into an Integer or int variable in Java. The generated table class provided by Sqoop can automate this process, allowing you to focus on the actual MapReduce job to run. Each autogenerated class has several overloaded methods named parse() that operate on the data represented as Text, CharSequence, char[], or other common types.

The MapReduce application called MaxWidgetId (available in the example code) will find the widget with the highest ID. The class can be compiled into a JAR file along with Widget.java. Both Hadoop (hadoop-core-version.jar) and Sqoop (sqoop-version.jar) will need to be on the classpath for compilation. The class files can then be combined into a JAR file and executed like so:

This command line ensures that Sqoop is on the classpath locally (via $HADOOP_CLASS PATH), when running the MaxWidgetId.run() method, as well as when map tasks are running on the cluster (via the -libjars argument).

When run, the maxwidgets path in HDFS will contain a file named part-r-00000 with the following expected result:

It is worth noting that in this example MapReduce program, a Widget object was emitted from the mapper to the reducer; the auto-generated Widget class implements the Writable interface provided by Hadoop, which allows the object to be sent via Hadoop’s serialization mechanism, as well as written to and read from SequenceFiles.

The MaxWidgetId example is built on the new MapReduce API. MapReduce applications that rely on Sqoop-generated code can be built on the new or old APIs, though some advanced features (such as working with large objects) are more convenient to use in the new API.

Imported Data and Hive

As noted in Hive Chapter , for many types of analysis, using a system like Hive to handle relational operations can dramatically ease the development of the analytic pipeline.

Especially for data originally from a relational data source, using Hive makes a lot of sense. Hive and Sqoop together form a powerful toolchain for performing analysis. Suppose we had another log of data in our system, coming from a web-based widget purchasing system. This may return log files containing a widget id, a quantity, a shipping address, and an order date.

Here is a snippet from an example log of this type:

By using Hadoop to analyze this purchase log, we can gain insight into our sales operation. By combining this data with the data extracted from our relational data source (the widgets table), we can do better. In this example session, we will compute which zip code is responsible for the most sales dollars, so we can better focus our sales team’s operations. Doing this requires data from both the sales log and the widgets table.

The above table should be in a local file named sales.log for this to work. First, let’s load the sales data into Hive:

Sqoop can generate a Hive table based on a table from an existing relational data source. Since we’ve already imported the widgets data to HDFS, we can generate the Hive table definition and then load in the HDFS-resident data:

When creating a Hive table definition with a specific already-imported dataset in mind, we need to specify the delimiters used in that dataset. Otherwise, Sqoop will allow Hive to use its default delimiters (which are different from Sqoop’s default delimiters).

Hive’s type system is less rich than that of most SQL systems. Many SQL types do not have direct analogues in Hive. When Sqoop generates a Hive table definition for an import, it uses the best Hive type availableto hold a column’s values. This may result in a decrease in precision. When this occurs, Sqoop will provide you with a warning message, such as this one:

This three-step process of importing data to HDFS, creating the Hive table, and then loading the HDFS-resident data into Hive can be shortened to one step if you know that you want to import straight from a database directly into Hive. During an import, Sqoop can generate the Hive table definition and then load in the data. Had we not already performed the import, we could have executed this command, which re-creates the widgets table in Hive, based on the copy in MySQL:

The sqoop import tool run with the --hive-import argument will load the data directly from the source database into Hive; it infers a Hive schema automatically based on the schema for the table in the sourcedatabase. Using this, you can get started working with your data in Hive with only one command.

Regardless of which data import route we chose, we can now use the widgets data set and the sales data set together to calculate the most profitable zip code. Let’s do so, and also save the result of this query in another table for later:


All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

Hadoop Topics