Big Data Crash Course: Using Sqoop For Ingestion

Let’s get started with Sqoop – it’s the simplest component of the Hadoop ecosystem and can be described very quickly. Sqoop is a tool used for large-scale, bulk data transfer to/from a structured data source to HDFS or HBase.

NOTE: HDFS stands for Hadoop Distributed File System. It’s the underlying storage component of the Hadoop ecosystem. We discuss HDFS in further detail in a later section of this book.

We may use Sqoop as an active archive for our on-premise data, for ad-hoc data processing or as a tool to ingest data from our existing structured data sources to be joined / enriched with other data to provide additional business insight and value.

Sample data sources include: MySQL, PostgresSQL, SQL, Oracle and Teradata.

As it’s a simple process and manageable through the Hue user interface, it makes sense to run through an example as part of this course. More complex services that require linux command line or scripting knowledge will be outside the scope of this book.

NOTE: Hue is the user interface we use to interact with various Hadoop services.

So, let’s run through the steps you’ll need to take to setup a Sqoop job. We’re going to utilize Hue to do this, providing a visual user interface rather than command line.

From Hue, hit the ‘Sqoop Transfer’ option under the ‘Data Browsers’ drop down menu. You’ll see the below screen:

Click on the link ‘click here to add one’. You’ll be presented with the below screen. Note that there are no options in the drop down menus.

You’ll need to click on ‘add new link’ to add a new connection. You’ll be presented with the below screen. Go ahead and add a name to your new connection. For this example, we’re going to use the ‘generic-jdbc-connector’ option.

There are lots of jdbc connection options you could choose from. The most common are PostgresSQL, MySQL and SQL. 

Once we’ve got our connections all setup, the interface will take us through the remaining steps to create a connection. You’ll need to define:

TypeImport
ConnectionMYSQL
Table NameTest Table
Storage TypeHDFS
Output File TypeText
Output Directory/directory_name/

So there we have it. Sqoop enables us to do a bulk transfer from our structured data store (in this case MySQL) to HDFS and from HDFS out to structured data stores. It’s super easy to configure (especially through the Hue interface) and is an extremely powerful tool in our Hadoop toolbox.