Exercise 1. Configure the Hadoop environment and run the first Hadoop project, WordCount, in your IDE, in stand-alone mode.
Exercise 2. Do the same task as in Exercise 1, only configure Hadoop in pseudo-distributed mode, and run your WordCount example outside of the IDE.
Exercise 3. Configure a minimal cluster of 2-3 nodes and run WordCount there. Make sure that the tasks get distributed to different nodes. Verify this with Hadoop logging.
Exercise 4. Customer billing. Each line of your input contains the timestamp for an instance of resource utilization, then a tab, and customer-related data: customer ID, resource ID, and resource unit price. Write a MapReduce job that will create, for each customer, a summary of resource utilization by the hour, and output the result into a relational database.
Sample input format:
Wed Jan 5 11:07:00 CST 2011 (Tab) Cust89347281 Res382915 $0.0035
Exercise 5. Generate test data for the exercise 4 above. In keeping with the general Hadoop philosophy, manual testing is not enough. Write an MR task to generate arbitrary amount of random test data from pre-defined small invoice, then run your answer to the exercise 4 and see if you get the results you started out with.
Your invoice may contain the following data:
Customer ID: Cust89347281
Date: Wed Jan 5
Resource: Res382915
Utilization
Hour Count Price Cost
01 15 $.0035 $0.0525
03 10 $.0035 $0.035
Exercise 6. Deduplication. Often input files contain records that are the same. This may happen in web crawling, when individual crawlers may come to the same URL. Write a MapReduce task that will "dedupe" the records, and output each of the same records only once.
Exercise 7. Write a distributed grep.
Art: Carl Larsson - Esbjorn Doing His Homework
No comments:
Post a Comment