Sunday, January 30, 2011

Ruminations on the use of Hadoop


MapReduce framework is used at Google, Yahoo, Facebook, biotech companies, news agencies, and in marketing. But why?

So how do you like this style?

Inspired by Ikai Lan

Here is how I see it, after this



 Ikai Lan 

[Blog] App Engine datastore tip: in a high write rate application, monotonically increasing index values are bad 


people said, "Dude, I don't understand what you are talking about, but I sure like the cartoons." Overnight, Ikai became the most famous Google cartoonist, and therefore the most famous cartoonist ever.

I can't even reach the master's ankles, but nevertheless I started drawing, and here are my first two cartoons on the use of Hadoop for perfect matching.



Now I have a few things I need to work on:

But you can see that I am excited about cartoon-style illustrations.

For all the rest,
Let Lion, Moonshine, Wall, and lovers twain
At large discourse, while here they do remain.

Friday, January 14, 2011

Loading data from MySQL to Hadoop

Suppose you need to load your data from MySQL to Hadoop. But why might you want that?

A typical use case might be when you have an ever-growing MySQL database, which comes from users of a web site, and you require significant amount of computations on this data. You might be running a dating site, and you may want to compare the match of each of (millions) of the candidates to each of the other ones. Or you may be running an auction web site, where you may need to try to match all bids with all offers.

There are basically two approaches you can take.

  1. Import the data into Hadoop on a regular basis, then run the computations. This can be done with Sqoop, HIHO, custom file formats on top of Hadoop API, Cascading, cascading-dbmigrate. You could dump the data into text format files for Hadoop into HDFS. Then parallelize the computations.
  2. Change the algorithm, seeing that a brute-force compare grows as a square of the number of users. You may try, for example, to partition the data, sort it before matching, etc.
You may end up with the combination of the two. I want to try these approaches and post the results (although I am specifically interested in the load part).

Why would you want to load complete data every time? Because your users updates it. But you might also want to select only rows where the information was updated, and match that, keeping the unchanged results in some other storage.

It would also be interesting to see if Pregel (Hama) can help.

(This post was inspired by a discussion on the core-user mailing list - my gratitude to the person who asked the question and to the talented group who answered).

Art: Claude Oscar Monet - Unloading Coal

Wednesday, January 12, 2011

How to learn HBase

The basic answer is

  • Read the BigTable paper;
  • Learn the HBase architecture;
  • Read the forum message and absorb there.
Here is one place where this is well said in more details, and besides, one of my HBase expert friends gave the same answer.

I am beginning the fourth chapter of the book, where I want to offer programmers and other software developers :) a way that would teach them what they need to know, in a simple, clear way, and also give them the best design practices. When I have that, I will give a short outline, and post the exercises.

Art: Vincent Van Gogh - Bridge at Arles

Tuesday, January 4, 2011

Exercises for chapter 1 - how do they feel?

Exercise 1. Configure the Hadoop environment and run the first Hadoop project, WordCount, in your IDE, in stand-alone mode.

Exercise 2. Do the same task as in Exercise 1, only configure Hadoop in pseudo-distributed mode, and run your WordCount example outside of the IDE.

Exercise 3. Configure a minimal cluster of 2-3 nodes and run WordCount there. Make sure that the tasks get distributed to different nodes. Verify this with Hadoop logging.

Exercise 4. Customer billing. Each line of your input contains the timestamp for an instance of resource utilization, then a tab, and customer-related data: customer ID, resource ID, and resource unit price. Write a MapReduce job that will create, for each customer, a summary of resource utilization by the hour, and output the result into a relational database.

Sample input format:

Wed Jan 5 11:07:00 CST 2011 (Tab) Cust89347281 Res382915 $0.0035

Exercise 5. Generate test data for the exercise 4 above. In keeping with the general Hadoop philosophy, manual testing is not enough. Write an MR task to generate arbitrary amount of random test data from pre-defined small invoice, then run your answer to the exercise 4 and see if you get the results you started out with.

Your invoice may contain the following data:

Customer ID: Cust89347281
Date: Wed Jan 5
Resource: Res382915

Utilization

Hour    Count    Price    Cost
01    15    $.0035    $0.0525
03    10    $.0035    $0.035
Exercise 6. Deduplication. Often input files contain records that are the same. This may happen in web crawling, when individual crawlers may come to the same URL. Write a MapReduce task that will "dedupe" the records, and output each of the same records only  once.

Exercise 7. Write a distributed grep.

Art: Carl Larsson - Esbjorn Doing His Homework

Monday, January 3, 2011

The Best Way to Help People with Exercises?

At the end of each chapter, the readers of "Hadoop in Practice" will find the exercises. Practice makes perfect, and doing exercises is invaluable, but how to prop the readers to do it?

  1. Have a forum where the students can find help;
  2. Allow to submit the answers?
  3. Or maybe not, since the exercise is correct if it works?
  4. Any other suggestions?

Art: Thomas Rowlandson - Private practice previous to the ball, from Scenes at Bath