Skip to main content

Let us 'Sqoop' it ! .

SQOOP - The bridge between traditional and novel big data systems.

By now,we have seen articles about MapReduce programs to write programs using Hadoop MapReduce framework.However, all the operations were actually performed on sample text files.

Here comes Sqoop to our rescue.Apache Sqoop is a tool developed to fetch/put data from traditional SQL based data storage systems like MySQL,PostgreSQL,MS SQL Server.Sqoop can also be used to fetch/push data from NoSQL systems too.This versatility is because of Sqoop's architecture abstracted on MapReduce framework.

Sqoop has an extension framework that makes it possible from and to any external storage system that has bulk data transfer capabilities.A Sqoop connector is a modular component that uses this framework to enable Sqoop imports and exports.Sqoop comes with connectors for working with a range of versatile popular databases including MySQL,PostgreSQL,Oracle,SQL Server,DB2 and Netezza.Apart from the above connectors Sqoop also supports many connectors which support  JDBC ( Java Database Connector) protocol.

Consider the case when we try to import data from a MySQL command (the detailed walk through is here).The sqoop command is given by

 
sqoop import --connect jdbc:mysql://localhost<HDFS directory>/<your_directory>\ --table <table_name> -m <number of map taks>(default : 4)

Let us analyze the above query step by step.The term import gives the fact that users are fetching data from a traditional SQL/no SQL system into HDFS.The option JDBC gives us the protocol used to connect for the map reduce operations required.Apart from the term table which is used to specify the table we are importing, the interesting argument we used the -m task which specifies the number of map tasks we need to perform to efficiently process the data import from the MySQL table to HDFS.

Internally, Sqoop's internal table will run a MapReduce job that connects to the MySQL database and reads the table.By default the Sqoop will user four map tasks in parallel to speed up the import process.We should note the point that each map task creates a new file in the HDFS system.

By default Sqoop will generate comma separated text files as the imported data.However, we can also specify the default delimiters explicitly by passing the argument in the import command.

The functionality:
The Sqoop command actually calls the JDBC protocol and once we fire up the import command, the tool checks the table columns and their SQL data types.These types are mapped into their equivalent data types of Java (String ,Integer) for the fields of Map Reduce operations.

Sqoop will guess the splitting column( usually the primary key of data) and the range/span of the data is identified.For example if the primary key values ranges from 1000 to 2000 and the number of map tasks are given to be 4, then each sub query will process 1000/4 = 250 records/rows as one map task and generate one file per map task.

select * from test_table;
This query fetches all the data from the test table into the HDF system.The way Sqoop handles ( assuming the default 4 map tasks and range of Primary keys being 1000) is as following
select * from test table where primary_key >1000 and primary_key<1250 <----- Map 1 task
select * from test table where primary_key >1250 and primary_key<1500 <----- Map 2 task
select * from test table where primary_key >1500 and primary_key<1750 <----- Map 3 task
select * from test table where primary_key >1750 and primary_key<2000 <----- Map 4 task
This approach provides us coarser access to the elements of the table reducing a linear search  over the entire row set.This helps the Hadoop frameworks such as Hive/Pig/MapReduce to search/apply the desired functions over a smaller sets of data and provide a faster approach over traditional SQL systems.Based on the volume of data, users should be wise enough to choose the number of map tasks.Higher map tasks increase the metadata and network traffic( a scarce commodity) while lower number of map tasks are poor performance gains.
The limitations of Sqoop at the point of writing this article is actually importing large objects such as BLOB and CLOB.Infact, these large objects are handled by the traditional databases in a much different way.They are in fact stored in a separate memory racks and are fetched accordingly to the corresponding rows as per the user query.
Since Sqoop runs MapReduce jobs under the hood, importing large objects into the system may be  cumbersome and add overhead onto the memory.To avoid the above problem, we can Sqoop will store the imported data into a separate file called the LobFile where each record holds the reference to a single object.The LobFile contains reference to the external object without actually duplicating the same into the HDFS system.However, the one drawback the users have is to maintain a connection to the traditional SQL based systems for the data fetching.
The Sqoop direct import command is one feature we can use to avoid the above mentioned problems of handling large BLOC and CLOB objects.The Sqoop direct command actually enables users to fetch the data from a traditional SQL dump.The extra parameter one needs to add is the --direct command  and the location of the SQL dump one the server.

Voila! I have Imported my data into Hadoop using Sqoop now and Manipulated it.Now how do I put it into SQL systems ?????

No Worries my friend but Sqoop provides a feature to export the data too! The semantics are almost the same except for an important factor of type casting.While importing the data, the variable types are mapped into nearest java data types.However while exporting the data back into the SQL environment , this translation is not done.The user needs to custom map the target table in the SQL systems to get the data without type mismatch error.

Internally Sqoop generates a java class which parses the HDFS records line by line based on the target table definition ( metadata). A map reduce job is then run which reads the source files from HDFS ,parses the records based on the generated java class and executes the given export database strategy.

The JDBC then exports the data into bulk INSERT statements to load data in batches into the target table.

A detailed tutorial on Hive, Sqoop and HDFS will be posted along with a hands-on on a  sample project soon over the Cloudera VM.

For further reading , visit the Apache Sqoop official documentation.
 


Comments

Popular posts from this blog

Cloudera setup

Installing Cloudera is a best way to kick start the cloud setup. Follow the below steps to setup Cloudera on your windows machine: 1) Download VMware player to open cloudera machine from your windows machine link :  https://www.vmware.com/products/player/playerpro-evaluation.html Install VMWare player. 2.) Download the Cloudera VM. Do the signup and stuff required to download cloudera VM. link :  http://www.cloudera.com/downloads.html Click on quick starts from the above link , select the latest version and VMWare and click on download. Approximately 5GB of data would be downloaded. So sit back and relax . Upon completion of Clodera VM download , extract the downloaded zip file to a convenient location. Launching the VM 1.) Open the VMWare player and click on open a virtual machine . Open the VM from the path where you have extracted the ClouderaVM .                               ...

Hive Example

Use Case : A super market would record the sales in a file . Whenever an item is sold , the name of item , number of units of sale and cost of each unit in a comma separated format. A sample file would look like below Apple,10,10 Mango,20,5 Guava,10,3 Banana,30,4 Apple,10,5 At the end of the day we are required to find  the total sales per each item. Expected Output : Apple 150 Mango 100 Guava 30 Banana 120 Implementing in HIVE Getting started with HIVE: Open a terminal and type hive , this will open the hive shell Create and use sales database : Create database : create database salesdb; Use the database : use salesdb; Create sales table: CREATE TABLE ITEM_SALES_RECORD ( ITEM_NAME string, UNITS int, UNIT_COST decimal)  ROW FORMAT DELIMITED  FIELDS TERMINATED BY ","  LINES TERMINATED BY "\n"; NOTE : Table names and column names are not case sensitive. Insert data into table from file: Use the java file to gener...