A typical use case might be when you have an ever-growing MySQL database, which comes from users of a web site, and you require significant amount of computations on this data. You might be running a dating site, and you may want to compare the match of each of (millions) of the candidates to each of the other ones. Or you may be running an auction web site, where you may need to try to match all bids with all offers.
There are basically two approaches you can take.
- Import the data into Hadoop on a regular basis, then run the computations. This can be done with Sqoop, HIHO, custom file formats on top of Hadoop API, Cascading, cascading-dbmigrate. You could dump the data into text format files for Hadoop into HDFS. Then parallelize the computations.
- Change the algorithm, seeing that a brute-force compare grows as a square of the number of users. You may try, for example, to partition the data, sort it before matching, etc.
Why would you want to load complete data every time? Because your users updates it. But you might also want to select only rows where the information was updated, and match that, keeping the unchanged results in some other storage.
It would also be interesting to see if Pregel (Hama) can help.
(This post was inspired by a discussion on the core-user mailing list - my gratitude to the person who asked the question and to the talented group who answered).
Art: Claude Oscar Monet - Unloading Coal