SQOOP - The bridge between traditional and novel big data systems.
By now,we have seen articles about MapReduce programs to write programs using Hadoop MapReduce framework.However, all the operations were actually performed on sample text files.
Here comes Sqoop to our rescue.Apache Sqoop is a tool developed to fetch/put data from traditional SQL based data storage systems like MySQL,PostgreSQL,MS SQL Server.Sqoop can also be used to fetch/push data from NoSQL systems too.This versatility is because of Sqoop's architecture abstracted on MapReduce framework.
Sqoop has an extension framework that makes it possible from and to any external storage system that has bulk data transfer capabilities.A Sqoop connector is a modular component that uses this framework to enable Sqoop imports and exports.Sqoop comes with connectors for working with a range of versatile popular databases including MySQL,PostgreSQL,Oracle,SQL Server,DB2 and Netezza.Apart from the above connectors Sqoop also supports many connectors which support JDBC ( Java Database Connector) protocol.
Consider the case when we try to import data from a MySQL command (the detailed walk through is here).The sqoop command is given by
By now,we have seen articles about MapReduce programs to write programs using Hadoop MapReduce framework.However, all the operations were actually performed on sample text files.
Here comes Sqoop to our rescue.Apache Sqoop is a tool developed to fetch/put data from traditional SQL based data storage systems like MySQL,PostgreSQL,MS SQL Server.Sqoop can also be used to fetch/push data from NoSQL systems too.This versatility is because of Sqoop's architecture abstracted on MapReduce framework.
Sqoop has an extension framework that makes it possible from and to any external storage system that has bulk data transfer capabilities.A Sqoop connector is a modular component that uses this framework to enable Sqoop imports and exports.Sqoop comes with connectors for working with a range of versatile popular databases including MySQL,PostgreSQL,Oracle,SQL Server,DB2 and Netezza.Apart from the above connectors Sqoop also supports many connectors which support JDBC ( Java Database Connector) protocol.
Consider the case when we try to import data from a MySQL command (the detailed walk through is here).The sqoop command is given by
sqoop import --connect jdbc:mysql://localhost<HDFS directory>/<your_directory>\ --table <table_name> -m <number of map taks>(default : 4)
Let us analyze the above query step by step.The term import gives the fact that users are fetching data from a traditional SQL/no SQL system into HDFS.The option JDBC gives us the protocol used to connect for the map reduce operations required.Apart from the term table which is used to specify the table we are importing, the interesting argument we used the -m task which specifies the number of map tasks we need to perform to efficiently process the data import from the MySQL table to HDFS.
Internally, Sqoop's internal table will run a MapReduce job that connects to the MySQL database and reads the table.By default the Sqoop will user four map tasks in parallel to speed up the import process.We should note the point that each map task creates a new file in the HDFS system.
By default Sqoop will generate comma separated text files as the imported data.However, we can also specify the default delimiters explicitly by passing the argument in the import command.
The functionality:
The Sqoop command actually calls the JDBC protocol and once we fire up the import command, the tool checks the table columns and their SQL data types.These types are mapped into their equivalent data types of Java (String ,Integer) for the fields of Map Reduce operations.
Sqoop will guess the splitting column( usually the primary key of data) and the range/span of the data is identified.For example if the primary key values ranges from 1000 to 2000 and the number of map tasks are given to be 4, then each sub query will process 1000/4 = 250 records/rows as one map task and generate one file per map task.
select * from test_table;
Let us analyze the above query step by step.The term import gives the fact that users are fetching data from a traditional SQL/no SQL system into HDFS.The option JDBC gives us the protocol used to connect for the map reduce operations required.Apart from the term table which is used to specify the table we are importing, the interesting argument we used the -m task which specifies the number of map tasks we need to perform to efficiently process the data import from the MySQL table to HDFS.
Internally, Sqoop's internal table will run a MapReduce job that connects to the MySQL database and reads the table.By default the Sqoop will user four map tasks in parallel to speed up the import process.We should note the point that each map task creates a new file in the HDFS system.
By default Sqoop will generate comma separated text files as the imported data.However, we can also specify the default delimiters explicitly by passing the argument in the import command.
The functionality:
The Sqoop command actually calls the JDBC protocol and once we fire up the import command, the tool checks the table columns and their SQL data types.These types are mapped into their equivalent data types of Java (String ,Integer) for the fields of Map Reduce operations.
Sqoop will guess the splitting column( usually the primary key of data) and the range/span of the data is identified.For example if the primary key values ranges from 1000 to 2000 and the number of map tasks are given to be 4, then each sub query will process 1000/4 = 250 records/rows as one map task and generate one file per map task.
select * from test_table;
This query fetches all the data from the test table into the HDF system.The way Sqoop handles ( assuming the default 4 map tasks and range of Primary keys being 1000) is as following
select * from test table where primary_key >1000 and primary_key<1250 <----- Map 1 task
select * from test table where primary_key >1250 and primary_key<1500 <----- Map 2 task
select * from test table where primary_key >1500 and primary_key<1750 <----- Map 3 task
select * from test table where primary_key >1750 and primary_key<2000 <----- Map 4 task
This approach provides us coarser access to the elements of the table reducing a linear search over the entire row set.This helps the Hadoop frameworks such as Hive/Pig/MapReduce to search/apply the desired functions over a smaller sets of data and provide a faster approach over traditional SQL systems.Based on the volume of data, users should be wise enough to choose the number of map tasks.Higher map tasks increase the metadata and network traffic( a scarce commodity) while lower number of map tasks are poor performance gains.
The limitations of Sqoop at the point of writing this article is actually importing large objects such as BLOB and CLOB.Infact, these large objects are handled by the traditional databases in a much different way.They are in fact stored in a separate memory racks and are fetched accordingly to the corresponding rows as per the user query.
Since Sqoop runs MapReduce jobs under the hood, importing large objects into the system may be cumbersome and add overhead onto the memory.To avoid the above problem, we can Sqoop will store the imported data into a separate file called the LobFile where each record holds the reference to a single object.The LobFile contains reference to the external object without actually duplicating the same into the HDFS system.However, the one drawback the users have is to maintain a connection to the traditional SQL based systems for the data fetching.
The Sqoop direct import command is one feature we can use to avoid the above mentioned problems of handling large BLOC and CLOB objects.The Sqoop direct command actually enables users to fetch the data from a traditional SQL dump.The extra parameter one needs to add is the --direct command and the location of the SQL dump one the server.
Voila! I have Imported my data into Hadoop using Sqoop now and Manipulated it.Now how do I put it into SQL systems ?????
No Worries my friend but Sqoop provides a feature to export the data too! The semantics are almost the same except for an important factor of type casting.While importing the data, the variable types are mapped into nearest java data types.However while exporting the data back into the SQL environment , this translation is not done.The user needs to custom map the target table in the SQL systems to get the data without type mismatch error.
Internally Sqoop generates a java class which parses the HDFS records line by line based on the target table definition ( metadata). A map reduce job is then run which reads the source files from HDFS ,parses the records based on the generated java class and executes the given export database strategy.
The JDBC then exports the data into bulk INSERT statements to load data in batches into the target table.
A detailed tutorial on Hive, Sqoop and HDFS will be posted along with a hands-on on a sample project soon over the Cloudera VM.
For further reading , visit the Apache Sqoop official documentation.
Comments
Post a Comment