====== Lab 05: MapReduce in the Cloud ====== //By: Laureline David & Michael Rohrer// ==== Pedagogical Objectives ==== * Perform data analysis in the cloud using a dynamically allocated cluster of machines * Write a MapReduce program * Become familiar with Hadoop ==== Tasks ==== In this lab you will perform a number of tasks and document your progress in a lab report. Each task specifies one or more deliverables to be produced. Collect all the deliverables in your lab report. Give the lab report a structure that mimics the structure of this document. ===== Task 1 - Using Elastic MapReduce ===== ==== Copy the lines that are following and which contain several tens of counters into the lab report. ==== No such lines were found in the syslog, which is given as an annex, or the stderr files. It is probably because of the Hadoop distribution which is Amazon 1.0.3. ==== Copy a screenshot of the EMR console into the report. ==== {{ :heig:cld:lab05_c1.png?nolink |}} {{ :heig:cld:lab05_s1.png?nolink |}} ==== Copy the bar chart of maximum temperature by year into the report. ==== {{ :heig:cld:lab05graph1.png?nolink |}} ==== What is the overall highest temperature in the data set? ==== The overal highest temperature is 38.0 degrees. This temperature has been reached in 2003. ==== How many EC2 instances were created to run the job? ==== Three EC2 instances were created to run this job. We can see it on the next screnshot. {{ :heig:cld:lab05_s1.png?nolink |}} This pricing test has been made with 3 EMR instances of type m1.small. This job took 19 minutes to complete so we have been charged for a 1 hour. The price for it was about 0.18 $. It's important to notice that EC2 instances are already included in the price of EMR. {{ :heig:cld:price_1h_month.png?nolink |}} ==== How many input key-value pairs all the mappers did process ? ==== It is not written in the log file. It is probably because of the Hadoop distribution which is Amazon 1.0.3. But we could have find it in the **Map input records** field. ==== How many input key-value pairs all the reducers did process ? ==== It is not written in the log file. It is probably because of the Hadoop distribution which is Amazon 1.0.3. But we could have find it in the **Reduce input groups** field. ===== Task 2 - Writing a MapReduce Program ===== ==== Copy the code of the mapper and reducer into the report. ==== #!/usr/bin/env python import re import sys for line in sys.stdin: val = line.strip() temp = val[87:91] q = val[92:93] if (temp != "+999" and re.match("[01459]", q)): print "%s\t%s" % (temp, "1") #!/usr/bin/env python import sys last_key = None count_val = 0 for line in sys.stdin: (key, val) = line.strip().split("\t") if last_key and last_key != key: print "%s\t%s" % (last_key, count_val) count_val = 0 count_val += 1 last_key = key if last_key: print "%s\t%s" % (last_key, count_val) #!/usr/bin/env python import sys import math # Maximum bar width width = 120 temps = {} maximum = 0 for line in sys.stdin: (key, val) = line.strip().split("\t") key = int(key) val = int(val) temps[key] = val maximum = max(maximum, val) for key, count in sorted(temps.items(), key=lambda row: row[0]): print "{:+03d} | [{:6d}] {:s}".format(key, count, "X" * int(max(1, math.floor((count / float(maximum)) * width)))) ==== How often does the temperature 22 degrees celsius occur? ==== 56'530 times ==== What is the lowest and highest temperature occuring? ==== Maximum is 38 degrees, minimum is -25 degrees. ==== Which temperature occurs most often? ==== 14 degrees, with 114613 occurences. ==== Download the output data and using a spreadsheet produce a histogram. ==== {{:heig:cld:lab05_awesome_bar_chart.jpg?nolink|}}