Migrating data from a SQL database to Hadoop
If you're in the process of Hadoopifying your back-end, you most likely have a lot of data that you need migrated from your SQL databases to Hadoop. There are two notable existing solutions out there for doing this: Cloudera's Sqoop and cascading.jdbc. We ran into issues with both of these packages, so we made our own solution called cascading-dbmigrate which we open-sourced today.
Let's go through each solution:
Sqoop
Sqoop works by looking at your database table, generating a class to represent a single row from the table, doing some magic to determine the most efficient way to split up the table into chunks for the MapReduce job, and then migrating to a directory on HDFS according to options you specify.
The first problem we had with Sqoop was that it just plain didn't work. Sqoop doesn't detect unsigned columns and will try to fit an unsigned integer into a java int which causes errors. I had to modify and rebuild Sqoop to get it to use larger datatypes.
After fixing these bugs, Sqoop dumped our database table in a text file format. Each line represented a row and each column was separated by a comma. This dump was broken though since Sqoop wasn't handling commas or newlines in our database in any special way. So our database dump was essentially unusable.
We next tried the --as-sequencefile option of Sqoop, which dumps the data into SequenceFiles with each row of data being an object of the class Sqoop generated. This works to get the data into HDFS, but consuming the data is then a pain. First, when you want to consume those SequenceFile's, you need to copy the generated code into your repository. Since the generated code depends on the Sqoop code, you need to include the Sqoop jar as a dependency in your build. Second, we wanted to run some custom logic to transform our database records into a graph-based schema so this required yet another job to be run.
Sqoop just requires way too much work to get your data migrated.
cascading.jdbc
cascading.jdbc provides a custom tap to allow you to integrate data from databases into your flows like any other source. Getting up and running with cascading.jdbc wasn't too hard, but we ran into really bad performance issues. cascading.jdbc uses LIMIT and OFFSET in its SQL queries which have terrible performance. Our flows literally wouldn't run on any table of reasonable size because the queries would take too long.
cascading-dbmigrate
We like the approach of cascading.jdbc in that we can both read the data from the database and run our custom migration logic in one job. To relieve the performance issues, we wrote a new tap that uses range queries over the primary key column of a table to read in data for each task. Check out the project for more details.
cascading-dbmigrate works for any table that has a primary key column that is either an int or a long. We find this to be the most common case and it holds of all the tables we are migrating at BackType. If this is not the case for you, cascading-dbmigrate won't be useful.
You can follow me on Twitter at @nathanmarz.