====== Lab 05: MapReduce in the Cloud ======

//By: Laureline David & Michael Rohrer//
==== Pedagogical Objectives ====

  * Perform data analysis in the cloud using a dynamically allocated cluster of machines
  * Write a MapReduce program
  * Become familiar with Hadoop

==== Tasks ====

In this lab you will perform a number of tasks and document your progress in a lab report. Each task specifies one or more deliverables to be produced. Collect all the deliverables in your lab report. Give the lab report a structure that mimics the structure of this document.

===== Task 1 - Using Elastic MapReduce =====

==== Copy the lines that are following and which contain several tens of counters into the lab report. ====

No such lines were found in the syslog, which is given as an annex, or the stderr files. It is probably because of the Hadoop distribution which is Amazon 1.0.3.


==== Copy a screenshot of the EMR console into the report. ====

{{ :heig:cld:lab05_c1.png?nolink |}}

{{ :heig:cld:lab05_s1.png?nolink |}}

==== Copy the bar chart of maximum temperature by year into the report. ====

{{ :heig:cld:lab05graph1.png?nolink |}}

==== What is the overall highest temperature in the data set? ====

The overal highest temperature is 38.0 degrees. This temperature has been reached in 2003.

==== How many EC2 instances were created to run the job? ====

Three EC2 instances were created to run this job. We can see it on the next screnshot.

{{ :heig:cld:lab05_s1.png?nolink |}}

This pricing test has been made with 3 EMR instances of type m1.small. This job took 19 minutes to complete so we have been charged for a 1 hour. The price for it was about 0.18 $. It's important to notice that EC2 instances are already included in the price of EMR.

{{ :heig:cld:price_1h_month.png?nolink |}}

==== How many input key-value pairs all the mappers did process ? ====

It is not written in the log file. It is probably because of the Hadoop distribution which is Amazon 1.0.3. But we could have find it in the **Map input records** field.

==== How many input key-value pairs all the reducers did process ? ====

It is not written in the log file. It is probably because of the Hadoop distribution which is Amazon 1.0.3. But we could have find it in the **Reduce input groups** field.


===== Task 2 - Writing a MapReduce Program =====

==== Copy the code of the mapper and reducer into the report. ====

<file python mapper.py>
#!/usr/bin/env python

import re
import sys

for line in sys.stdin:
  val = line.strip()
  temp = val[87:91]
  q = val[92:93]
  if (temp != "+999" and re.match("[01459]", q)):
    print "%s\t%s" % (temp, "1")
</file>
    
<file python reducer.py>
#!/usr/bin/env python

import sys

last_key = None
count_val = 0
for line in sys.stdin:
  (key, val) = line.strip().split("\t")
  if last_key and last_key != key:
    print "%s\t%s" % (last_key, count_val)
    count_val = 0

  count_val += 1
  last_key = key

if last_key:
  print "%s\t%s" % (last_key, count_val)
</file>
  
<file python grapher.py>
#!/usr/bin/env python

import sys
import math

# Maximum bar width
width = 120

temps = {}
maximum = 0
for line in sys.stdin:
  (key, val) = line.strip().split("\t")
  key = int(key)
  val = int(val)

  temps[key] = val
  maximum = max(maximum, val)

for key, count in sorted(temps.items(), key=lambda row: row[0]):
    print "{:+03d} | [{:6d}] {:s}".format(key, count, "X" * int(max(1, math.floor((count / float(maximum)) * width))))
</file>

==== How often does the temperature 22 degrees celsius occur? ====

56'530 times

==== What is the lowest and highest temperature occuring? ====

Maximum is 38 degrees, minimum is -25 degrees.

==== Which temperature occurs most often? ====

14 degrees, with 114613 occurences.

==== Download the output data and using a spreadsheet produce a histogram. ====

{{:heig:cld:lab05_awesome_bar_chart.jpg?nolink|}}