Archive for the ‘data’ tag

Using Hadoop and Amazon Elastic MapReduce to Process Your Data More Efficiently

with 4 comments

If the amount of data in your enterprise is overwhelming and/or you’re looking for ways to process said data more efficiently, then Hadoop and Amazon Elastic MapReduce may be your answer.

MapReduce frameworks allow developers without much knowledge on distributed computing to write applications that take advantage of distributed resources. Hadoop MapReduce is an implementation of such a model.

Background:

Recently, we developed a web asset delivery service for one of our clients that would allow businesses to display high quality assets from a CDN in their websites for a monthly fee. Users’ accounts would be associated with bandwidth limits based on different account levels associated with a pricing model. This meant that we needed a way to provide users with information on monthly bandwidth utilization across their websites in order to defend the pricing model. The solution to this was to implement web log parsing using Hadoop’s MapReduce framework in conjunction with Amazon’s cloud-based Elastic MapReduce service.

Here’s how Hadoop MapReduce works:

Hadoop MapReduce is a Java-based framework that allows you to write applications that process high volumes of data in parallel clusters. Hadoop uses a distributed file storage system called Hadoop Distributed File System (HDFS) to store large amount of data across multiple nodes. It supports most major platforms and MapReduce programs can be written in Python, Ruby, php, Pig, etc., in addition to Java.  Using Hadoop, we were able to write a simple Java program that could easily parse through raw data in log files collected by the CDN and filter relevant bandwidth utilization information.

The basic idea around a map-reduce model is that you write two functions — map() and reduce() — to divide up your programming tasks and let the framework manage most of the crunching. Map and reduce functions take in key-value pairs (using data types that implement Hadoop’s Writable interface) as the input and output. When you start a map reduce process, you pass in a data file in HDFS as the input. Hadoop divides up the inputs into smaller pieces that the map function can consume. Likewise, the outputs of the map function are grouped together in logical chunks by Hadoop and sent to the reduce function for processing.  Both map and reduce functions can run in parallel — Hadoop can distribute the tasks across various clusters of nodes.

In our case, we simply pass the log file(s) (copied to HDFS) as input to the map reduce program.  Hadoop merges all log files specified and serializes each log entry to the datatype expected (Text) before passing them as inputs to the map reduce tasks. Our map function then parses each long entry individually and stores the relevant data (bandwidth info) in a HashMap type object (MapWritable) which is then sent as another key-value pair (<asset path – MapWritable object>) for the reduce function to work with. The reduce function then aggregates the data based on user accounts, date, user agent, etc. and saves it to a database (Amazon RDS Database). We can then query the database to pull all types of information around utilization and send out notifications to users, for example, if their account is over the monthly cap, etc.

Below is the structure of a sample map reduce program written in Java:

public class LogProcessor {

  public static class LogMap
            extends Mapper<LongWritable, Text, Text, MapWritable> {
    public void map( LongWritable key, Text value, Context context ) {
      MapWritable logEntry = new MapWritable();
      //parse log file
      ...
      Text key = new Text();
      //key = resource-path;

      context.write( key, logEntry);
    }
  }

  public static class LogReduce
            extends Reducer<Text, MapWritable, DBWritable, NullWritable> {
    public void reduce( Text key, Iterable<MapWritable> values, Context context ) {
      while(values.iterator().hasNext()) {
        MapWritable entry = values.iterator().next();
        //process entry and write to db
        ...
      }
    }
  }

  public static void main(String[] args) {
    // Set up a new mapreduce job
    Job job = new Job();
    job.setJarByClass(LogProcessor.class);    //register the main class

    FileInputFormat.addInputPath( job, new Path(<Input file path>) );
    FileOutputFormat.setInputPath( job, new Path(<output file path> ) );

    job.setMapperClass( LogMap.class );
    job.setReducerClass( LogReduce.class );

    job.setOutputKeyClass( Text.class );
    job.setOutputValueClass( MapWritable.class );

    job.waitForCompletion(true) ? System.exit(0) : System.exit(1);
  }
}

The program is packaged in a jar file (with dependencies) that Hadoop can run.

And here’s how to utilize Amazon’s Elastic MapReduce service to run the program:

At Control Group, we leverage Amazon’s cloud based infrastructure heavily in lots of projects. It basically allows us to cost effectively (pay by usage) deploy applications that need to scale up very easily. Amazon’s Elastic MapReduce service is the perfect fit for running our MapReduce application described above. It’s easy to set up, and it also shields off some of the infrastructure/maintenance issues around running Hadoop.

In a nutshell, the Elastic MapReduce service runs a hosted Hadoop instance on an EC2 instance (master), and it’s able to instantly provision other pre-configured EC2 instances (slave nodes) to distribute the MapReduce process, which are all terminated once the MapReduce tasks complete running. Amazon allows us to specify up to 20 EC2 instances for data intensive processing. It also provides the option to upgrade your Elastic MapReduce to increase EC2 instance count.

So to run the map reduce service, we create a new “Job Flow” via the AWS console, the command line utility (ruby based) or an API provided by Amazon. A job flow is a set of steps that Elastic MapReduce runs. You basically provide some configuration information (number of EC2 instances to use and bootstrap actions) and the location of your map reduce program ( usually an Amazon S3 bucket path). Job flow records/logs can be viewed at the AWS console. You can also explicitly instruct Elastic MapReduce to keep the master EC2 instance alive for debugging purposes – you can then ssh into the instance to check the log files created by Hadoop, etc.

In summary, Hadoop’s MapReduce framework allows us to write simple applications that process high volumes of data in a distributed computing environment while Amazon’s MapReduce service provides a cost-effective means of implementing such a solution.

Share this: Share this page via Digg this Share this page via Facebook Share this page via Twitter Share this with Linked in

Are We Too Dependent on Technology?

with one comment

Our coffeemaker is broken.

Coffee is a big deal at CG. Most of our geeks pride themselves on being caffeinated and with the coffee machine down panic is on the rise.

The thing about this is that it’s not the actual coffeemaker that is broken, it’s the grinder built into the coffee machine that is having issues. We have a fancy machine that grinds coffee right before brewing it. When it’s working it’s pretty magical — nothing tastes quite like freshly brewed coffee made from freshly ground beans. It happens automatically and other than the noise from the grinder we don’t even know that it’s there.

When the machine is down it’s obvious. Some of us need coffee to work. We are dependent on that machine.

Thinking about the caffeine situation in the office made me wonder about other pieces of technology that we’re dependent on. Our email software runs in the cloud on Google’s computer. Our data traverses networks and is converted to light, microwaves, electricity and back again before it arrives at its destination. Do you know how an email sent from your phone is routed to its destination? What other things does technology do for us automatically that we don’t notice? Heck, I can’t even remember my wife’s phone number — my phone does it for me.

Someone sent me an article the other day entitled, “The Cloud Fails Again.” In the article, John Dvorak complains that a power outage left him unable to function because all of his data and services existed in the cloud and not in his own machine. He goes on to describe a “priesthood” of systems administrators that has existed since the early days of computing whose sole purpose is to “beat back the individualism” that desktop computers brought to all of us.

I was unaware that this cabal existed (if you are a member, please send me an invite) and I feel like the advances that technology has brought us in life, business and communication are really amazing. We live in a magical world. But even though the advances are great, they have made us completely dependent on technology. I think Dvorak’s article is a pretty good example for people who rely on technology and refuse to invest in their own infrastructure. In other words, we need to understand what we’re using so that we can evaluate the risks and benefits of using it.

Control Group’s mission is to help people and their organizations better understand and utilize their technology so they can be more efficient. That’s why Control Group is a great place to work — even when the coffee machine is down.

Our engineers were able to create a temporary workaround for the coffee situation. We’re also not exactly stranded in a coffee-free wasteland: Kaffe 1668 and The Blue Spoon are within walking distance. So, no worries, we’ll stay jittery.

Image via coffeeaddict/Flickr

Share this: Share this page via Digg this Share this page via Facebook Share this page via Twitter Share this with Linked in

services people careers press blog contact follow us