Friday, January 14, 2011

Loading data from MySQL to Hadoop

Suppose you need to load your data from MySQL to Hadoop. But why might you want that?

A typical use case might be when you have an ever-growing MySQL database, which comes from users of a web site, and you require significant amount of computations on this data. You might be running a dating site, and you may want to compare the match of each of (millions) of the candidates to each of the other ones. Or you may be running an auction web site, where you may need to try to match all bids with all offers.

There are basically two approaches you can take.

  1. Import the data into Hadoop on a regular basis, then run the computations. This can be done with Sqoop, HIHO, custom file formats on top of Hadoop API, Cascading, cascading-dbmigrate. You could dump the data into text format files for Hadoop into HDFS. Then parallelize the computations.
  2. Change the algorithm, seeing that a brute-force compare grows as a square of the number of users. You may try, for example, to partition the data, sort it before matching, etc.
You may end up with the combination of the two. I want to try these approaches and post the results (although I am specifically interested in the load part).

Why would you want to load complete data every time? Because your users updates it. But you might also want to select only rows where the information was updated, and match that, keeping the unchanged results in some other storage.

It would also be interesting to see if Pregel (Hama) can help.

(This post was inspired by a discussion on the core-user mailing list - my gratitude to the person who asked the question and to the talented group who answered).

Art: Claude Oscar Monet - Unloading Coal

2 comments:

  1. Hi, I am JB from India. I am a starter in Hadoop. I would like to give you a brief about the job I am doing now. I have a table containing 10,00000 (10 lacs rows) which is in HDFS (input) and I need to compare each of the row with another having 90,00000 (90 lacs) rows. I can not take the 2nd table as input. Need your help. -- Regards,
    JB (mailtojkbora@gmail.com)

    ReplyDelete
  2. I would like to help if I can - but I need more detail. Can you expand a little?

    Thank you,
    Mark

    ReplyDelete