A Sample Import - Hadoop

After you install Sqoop, you can use it to import data to Hadoop. Sqoop imports from databases; if you don’t already have a database server installed, you’ll need to choose one. MySQL is an easy-to-use database available for a large number of platforms.

To install and configure MySQL, follow the documentation at http://dev.mysql.com/ doc/refman/5.1/en/. MapReduce Chapter (“Installing and Upgrading MySQL”) in particular should help. Users of Debian-based Linux systems (e.g., Ubuntu) can type sudo aptget install mysql-client mysql-server. RedHat users can type sudo yum install mysql mysql-server.

Now that MySQL is installed, let’s log in and create a database (Example ).

Example . Creating a new MySQL database schema

The password prompt above asks for your root user password. This is likely the same as the password for the root shell login. If you are running Ubuntu or another variant of Linux where root cannot directly log in, then enter the password you picked at MySQL installation time.

In this session, we created a new database schema called hadoopguide, which we’ll use throughout this appendix. We then allowed any local user to view and modify the contents of the hadoopguide schema, and closed our session.*

* Of course, in a production deployment, we’d need to be much more careful about access control, but this serves for demonstration purposes. The above privilege grant also assumes you’re running a pseudodistributed Hadoop instance. If you’re working with a distributed Hadoop cluster, you’d need to enable remote access by at least one user, whose account will be used to perform imports and exports via Sqoop.

Now let’s log back into the database (not as root, but as yourself this time), and create a table to import into HDFS (Example).

Example . Populating the database

In the above listing, we created a new table called widgets. We’ll be using this fictional product database in further examples in this chapter. The widgets table contains several fields representing a variety of data types.

Now let’s use Sqoop to import this table into HDFS:

Sqoop’s import tool will run a MapReduce job that connects to the MySQL database and reads the table. By default, this will use four map tasks in parallel to speed up the import process. Each task will write its imported results to a different file, but all in a common directory. Since we knew that we had only three rows to import in this example, we specified that Sqoop should use a single map task (-m 1) so we get a singlefile in HDFS.

We can inspect this file’s contents like so:

The connect string (jdbc:mysql://localhost/hadoopguide) shown in the example will read from a database on the local machine. If a distributed Hadoop cluster is being used, then localhost should not be specified inthe connect string; map tasks not running on the same machine as the database will fail to connect. Even if Sqoop is run from the same host as the database sever, the full hostname should be specified.

By default, Sqoop will generate comma-delimited text files for our imported data. Delimiters can be explicitly specified, as well as field enclosing and escape characters to allow the presence of delimiters in the field contents. The command-line arguments that specify delimiter characters, file formats, compression, and more fine-grained control of the import process are described in the Sqoop User Guide distributed withSqoop, as well as in the online help (sqoop help import, or man sqoop-import in CDH).

Text and binary file formats

Sqoop is capable of importing into a few different file formats. Text files (the default) offer a human-readable representation of data, platform independence, and the simplest structure. However, they cannot holdbinary fields (such as database columns of type VARBINARY) and cannot distinguish between null values and String-based fields containing the value "null". To handle these conditions, you should use Sqoop’sSequenceFile-based format. The disadvantages of SequenceFiles is that they are Java-specific, and current versions of Sqoop cannot load them into Hive. But SequenceFiles provide the most precise representation ofthe imported data possible. SequenceFiles also allow data to be compressed while retaining MapReduce’s ability to process different sections of the same file in parallel.

Available at http://archive.cloudera.com/cdh/3/sqoop/.


All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

Hadoop Topics