Using Hadoop and Amazon Elastic MapReduce to Process Your Data More Efficiently

by

If the amount of data in your enterprise is overwhelming and/or you’re looking for ways to process said data more efficiently, then Hadoop and Amazon Elastic MapReduce may be your answer.

MapReduce frameworks allow developers without much knowledge on distributed computing to write applications that take advantage of distributed resources. Hadoop MapReduce is an implementation of such a model.

Background:

Recently, we developed a web asset delivery service for one of our clients that would allow businesses to display high quality assets from a CDN in their websites for a monthly fee. Users’ accounts would be associated with bandwidth limits based on different account levels associated with a pricing model. This meant that we needed a way to provide users with information on monthly bandwidth utilization across their websites in order to defend the pricing model. The solution to this was to implement web log parsing using Hadoop’s MapReduce framework in conjunction with Amazon’s cloud-based Elastic MapReduce service.

Here’s how Hadoop MapReduce works:

Hadoop MapReduce is a Java-based framework that allows you to write applications that process high volumes of data in parallel clusters. Hadoop uses a distributed file storage system called Hadoop Distributed File System (HDFS) to store large amount of data across multiple nodes. It supports most major platforms and MapReduce programs can be written in Python, Ruby, php, Pig, etc., in addition to Java.  Using Hadoop, we were able to write a simple Java program that could easily parse through raw data in log files collected by the CDN and filter relevant bandwidth utilization information.

The basic idea around a map-reduce model is that you write two functions — map() and reduce() — to divide up your programming tasks and let the framework manage most of the crunching. Map and reduce functions take in key-value pairs (using data types that implement Hadoop’s Writable interface) as the input and output. When you start a map reduce process, you pass in a data file in HDFS as the input. Hadoop divides up the inputs into smaller pieces that the map function can consume. Likewise, the outputs of the map function are grouped together in logical chunks by Hadoop and sent to the reduce function for processing.  Both map and reduce functions can run in parallel — Hadoop can distribute the tasks across various clusters of nodes.

In our case, we simply pass the log file(s) (copied to HDFS) as input to the map reduce program.  Hadoop merges all log files specified and serializes each log entry to the datatype expected (Text) before passing them as inputs to the map reduce tasks. Our map function then parses each long entry individually and stores the relevant data (bandwidth info) in a HashMap type object (MapWritable) which is then sent as another key-value pair (<asset path – MapWritable object>) for the reduce function to work with. The reduce function then aggregates the data based on user accounts, date, user agent, etc. and saves it to a database (Amazon RDS Database). We can then query the database to pull all types of information around utilization and send out notifications to users, for example, if their account is over the monthly cap, etc.

Below is the structure of a sample map reduce program written in Java:

public class LogProcessor {

  public static class LogMap
            extends Mapper<LongWritable, Text, Text, MapWritable> {
    public void map( LongWritable key, Text value, Context context ) {
      MapWritable logEntry = new MapWritable();
      //parse log file
      ...
      Text key = new Text();
      //key = resource-path;

      context.write( key, logEntry);
    }
  }

  public static class LogReduce
            extends Reducer<Text, MapWritable, DBWritable, NullWritable> {
    public void reduce( Text key, Iterable<MapWritable> values, Context context ) {
      while(values.iterator().hasNext()) {
        MapWritable entry = values.iterator().next();
        //process entry and write to db
        ...
      }
    }
  }

  public static void main(String[] args) {
    // Set up a new mapreduce job
    Job job = new Job();
    job.setJarByClass(LogProcessor.class);    //register the main class

    FileInputFormat.addInputPath( job, new Path(<Input file path>) );
    FileOutputFormat.setInputPath( job, new Path(<output file path> ) );

    job.setMapperClass( LogMap.class );
    job.setReducerClass( LogReduce.class );

    job.setOutputKeyClass( Text.class );
    job.setOutputValueClass( MapWritable.class );

    job.waitForCompletion(true) ? System.exit(0) : System.exit(1);
  }
}

The program is packaged in a jar file (with dependencies) that Hadoop can run.

And here’s how to utilize Amazon’s Elastic MapReduce service to run the program:

At Control Group, we leverage Amazon’s cloud based infrastructure heavily in lots of projects. It basically allows us to cost effectively (pay by usage) deploy applications that need to scale up very easily. Amazon’s Elastic MapReduce service is the perfect fit for running our MapReduce application described above. It’s easy to set up, and it also shields off some of the infrastructure/maintenance issues around running Hadoop.

In a nutshell, the Elastic MapReduce service runs a hosted Hadoop instance on an EC2 instance (master), and it’s able to instantly provision other pre-configured EC2 instances (slave nodes) to distribute the MapReduce process, which are all terminated once the MapReduce tasks complete running. Amazon allows us to specify up to 20 EC2 instances for data intensive processing. It also provides the option to upgrade your Elastic MapReduce to increase EC2 instance count.

So to run the map reduce service, we create a new “Job Flow” via the AWS console, the command line utility (ruby based) or an API provided by Amazon. A job flow is a set of steps that Elastic MapReduce runs. You basically provide some configuration information (number of EC2 instances to use and bootstrap actions) and the location of your map reduce program ( usually an Amazon S3 bucket path). Job flow records/logs can be viewed at the AWS console. You can also explicitly instruct Elastic MapReduce to keep the master EC2 instance alive for debugging purposes – you can then ssh into the instance to check the log files created by Hadoop, etc.

In summary, Hadoop’s MapReduce framework allows us to write simple applications that process high volumes of data in a distributed computing environment while Amazon’s MapReduce service provides a cost-effective means of implementing such a solution.

14 Responses to “Using Hadoop and Amazon Elastic MapReduce to Process Your Data More Efficiently”

  • tsanko says:

    Wonderful ..thanks a lot for posting a good informitive blog

  • […] [6] Ubin Malla, Using Hadoop and Amazon Elastic MapReduce to Process Your Data More Efficiently, http://blog.controlgroup.com/2010/10/13/hadoop-and-amazon-elastic-mapreduce-analyzing-log-files/ […]

  • thank you for this stuff

  • Ashwin says:

    Thank you! How to skip bad records on EMR Map/Reduce jobs?

  • Sanket says:

    Hi, how did you get your mapreduce jobs to store data in RDS? Just wondering what interface you used.

  • kocaeli says:

    The Zune concentrates on being a Portable Media Player. Not a web browser. Not a game machine. Maybe in the future it’ll do even better in those areas, but for now it’s a fantastic way to organize and listen to your music and videos, and is without peer in that regard. The iPod’s strengths are its web browsing and apps. If those sound more compelling, perhaps it is your best choice.

  • Paleo Diet says:

    Unique perspective. Thanks for posting that. I will check to this site to find out more and recommend my coworkers about your site.

  • fire safety says:

    Why users still make use of to read news papers when in this
    technological globe everything is available on net?

  • What’s up all, here every person is sharing these experience, therefore it’s good
    to read this web site, and I used to go to see this
    website every day.

  • Defining a particular region for the food choices of these types
    is not at all possible. Grill a few minutes on each side, until very golden.
    Waldorf salad is one such salad that is an amalgamation of celery, apples, walnuts, and mayonnaise.

  • I have been surfing online more than 4 hours today, yet I never found any
    interesting article like yours. It’s pretty worth enough for me. In my view, if all web owners and bloggers made good content as you did, the web will be a lot more useful than ever before.|
    I couldn’t resist commenting. Well written!|
    I will immediately seize your rss as I can not
    in finding your email subscription hyperlink or e-newsletter service.
    Do you’ve any? Please permit me recognise in order that I could subscribe. Thanks.|
    It is the best time to make some plans for the future and it’s time to be happy.
    I have read this post and if I could I desire to suggest you some interesting things or advice.

    Maybe you could write next articles referring to this article.
    I want to read more things about it!|
    It is appropriate time to make some plans for the future and it’s time to be happy. I’ve read this put up and if I could I want to recommend you few fascinating things or suggestions.
    Perhaps you could write next articles relating to this article.
    I desire to learn more issues about it!|
    I’ve been surfing on-line greater than three hours today, yet I never discovered any fascinating article like yours. It is lovely value sufficient for me. Personally, if all webmasters and bloggers made good content as you probably did, the net shall be a lot more useful than ever before.|
    Ahaa, its fastidious conversation regarding this article at this place at this web site, I have read all that, so at this time me also commenting here.|
    I am sure this paragraph has touched all the internet people, its really really fastidious piece of writing on building up new webpage.|
    Wow, this paragraph is good, my sister is analyzing such things, so I am going to inform her.|
    bookmarked!!, I really like your site!|
    Way cool! Some very valid points! I appreciate you writing this article plus the rest of the site is also very good.|
    Hi, I do think this is an excellent blog. I stumbledupon it ;) I may come back once again since i have book marked it. Money and freedom is the greatest way to change, may you be rich and continue to guide other people.|
    Woah! I’m really enjoying the template/theme
    of this website. It’s simple, yet effective. A lot of times it’s
    difficult to get that “perfect balance” between usability and
    visual appeal. I must say that you’ve done a amazing job with this. In addition, the blog loads extremely fast for me on Internet explorer. Superb Blog!|
    These are in fact impressive ideas in about blogging. You have touched some pleasant things here. Any way keep up wrinting.|
    Everyone loves what you guys tend to be up too. This kind of clever work and reporting! Keep up the awesome works guys I’ve added you guys to my personal blogroll.|
    Hey there! Someone in my Facebook group shared this site with us so I came to give it
    a look. I’m definitely enjoying the information. I’m book-marking
    and will be tweeting this to my followers! Great blog and excellent design and style.|
    I love what you guys are usually up too. This type of clever work and reporting!
    Keep up the excellent works guys I’ve incorporated you guys to my personal blogroll.|
    Hi there would you mind stating which blog platform you’re working
    with? I’m looking to start my own blog in the near future but I’m having a tough time making a
    decision between BlogEngine/Wordpress/B2evolution and Drupal.
    The reason I ask is because your design and style seems different then most blogs and I’m looking for something completely unique. P.S Apologies for getting off-topic but I had to ask!|
    Hey there would you mind letting me know which webhost you’re utilizing?
    I’ve loaded your blog in 3 different browsers and I must say this blog loads a lot quicker then most. Can you suggest a good hosting provider at a honest price? Many thanks, I appreciate it!|
    I like it when folks come together and share thoughts. Great website, stick with it!|
    Thank you for the auspicious writeup. It in fact was a amusement account it. Look advanced to more added agreeable from you! By the way, how could we communicate?|
    Hello just wanted to give you a quick heads up. The words in your content seem to be running off the screen in Safari. I’m
    not sure if this is a format issue or something to do with web browser compatibility but I thought I’d post to let you know. The design look great though! Hope you get the problem resolved soon. Kudos|
    This is a topic that is close to my heart… Many thanks! Where are your contact details though?|
    It’s very easy to find out any topic on net as compared to textbooks, as
    I found this paragraph at this web page.|
    Does your blog have a contact page? I’m having trouble locating it but, I’d like to
    send you an email. I’ve got some recommendations for your blog you might be interested in hearing. Either way, great blog and I look forward to seeing it grow over time.|
    Greetings! I’ve been reading your website for a while now and finally got
    the bravery to go ahead and give you a shout out from
    Atascocita Tx! Just wanted to tell you keep up the good job!|
    Greetings from California! I’m bored at work so I decided to browse your site on my iphone during lunch break. I enjoy the knowledge you provide here and can’t wait to take a look when I get home.
    I’m shocked at how quick your blog loaded on my phone .. I’m
    not even using WIFI, just 3G .. Anyways, awesome site!|
    Its like you read my mind! You seem to understand a lot approximately this, such as you
    wrote the e-book in it or something. I think that you simply can do
    with some p.c. to power the message house a bit, however other than that, that is
    wonderful blog. A fantastic read. I’ll certainly be back.|
    I visited several websites except the audio quality for audio songs present at this website is actually fabulous.|
    Hello, i read your blog from time to time and i own a similar one and i was just curious if you get a lot of spam remarks? If so how do you prevent it, any plugin or anything you can advise? I get so much lately it’s driving me insane so any assistance is very much
    appreciated.|
    Greetings! Very helpful advice within this post!
    It is the little changes which will make the
    biggest changes. Thanks for sharing!|
    I absolutely love your site.. Very nice colors & theme.
    Did you make this site yourself? Please reply back as I’m looking to create my very own site and want to know where you got this from or exactly what the theme is called. Thanks!|
    Howdy! This post could not be written much better! Going through this post reminds me of my previous roommate! He constantly kept preaching about this. I will send this information to him. Pretty sure he’s going
    to have a very good read. Thank you for sharing!|
    Amazing! This blog looks just like my old one!
    It’s on a completely different topic but it has pretty much the same layout and design. Great choice of colors!|
    There is certainly a great deal to find out about this subject. I love all the points you made.|
    You have made some decent points there. I looked on the net for more information about the issue and found most individuals will go along with your views on this web site.|
    What’s up, I log on to your blog like every week. Your story-telling
    style is witty, keep doing what you’re doing!|
    I simply could not go away your web site prior to suggesting that I actually enjoyed the usual info a person supply to your guests? Is gonna be back ceaselessly to investigate cross-check new posts|
    I needed to thank you for this great read!! I definitely loved every little bit of it. I have got you book marked to look at new things you post…|
    Hello, just wanted to say, I liked this article. It was funny. Keep on posting!|
    I create a comment whenever I like a post on a site or if I have something to add to the conversation. Usually it is a result of the passion communicated in the post I browsed. And on this article Using Hadoop and Amazon Elastic MapReduce to Process Your Data More Efficiently | Control GroupControl Group. I was moved enough to post a thought :-) I actually do have some questions for you if it’s okay.
    Is it only me or does it look like like some of the remarks come across as if they are written by brain dead people?
    :-P And, if you are writing on additional online social sites, I’d like to follow anything fresh you have to post. Would you list all of your community sites like your linkedin profile, Facebook page or twitter feed?|
    Hello, I enjoy reading all of your article post. I like to write a little comment to support you.|
    I every time spent my half an hour to read this web site’s content
    all the time along with a mug of coffee.|
    I every time emailed this website post page to all my friends, since if like to read it
    then my contacts will too.|
    My coder is trying to persuade me to move to .net from PHP.
    I have always disliked the idea because of the costs.
    But he’s tryiong none the less. I’ve been
    using Movable-type on several websites for about a year and am worried about switching to another platform.
    I have heard good things about blogengine.net. Is there a way I can import all my wordpress content
    into it? Any kind of help would be greatly appreciated!|
    Howdy! I could have sworn I’ve been to your blog before but after looking at some of the posts I realized it’s new to me.

    Regardless, I’m certainly delighted I came across it and I’ll
    be bookmarking it and checking back often!|
    Wonderful work! That is the kind of info that are supposed to be shared across the net.

    Disgrace on the seek engines for not positioning this put
    up upper! Come on over and discuss with my site .
    Thank you =)|
    Heya i am for the first time here. I came across this board and I find It truly useful
    & it helped me out a lot. I hope to give something
    back and help others like you helped me.|
    Hello, I do believe your web site could possibly be having internet browser compatibility issues.

    Whenever I take a look at your web site in Safari, it
    looks fine but when opening in IE, it has some overlapping issues.
    I merely wanted to provide you with a quick heads up!
    Aside from that, great site!|
    A person essentially help to make seriously posts I’d state. That is the first time I frequented your web page and so far? I surprised with the research you made to make this particular post incredible. Excellent job!|
    Heya i am for the primary time here. I found this board and I in finding It truly helpful & it helped me out much. I’m
    hoping to give one thing again and help others such as you helped me.|
    Hi! I simply want to offer you a big thumbs up for the great information you have here on this post.
    I’ll be returning to your site for more soon.|
    I always used to study paragraph in news papers but now as I am a user of net so from now I am using net for articles, thanks to web.|
    Your mode of explaining all in this article is in fact fastidious, every one be able to simply know it, Thanks a lot.|
    Hello there, I found your site by the use of Google whilst searching for a comparable subject, your web site got here up, it seems good. I have bookmarked it in my google bookmarks.
    Hello there, just become alert to your weblog through Google, and located that it’s
    really informative. I am going to watch out for brussels.
    I will be grateful should you proceed this in future.
    A lot of other people might be benefited out of your writing.

    Cheers!|
    I am curious to find out what blog platform you
    happen to be working with? I’m having some minor security issues with my latest website and I’d like to find something
    more safeguarded. Do you have any recommendations?|
    I am really impressed with your writing skills and also with the layout on your
    blog. Is this a paid theme or did you customize it yourself?
    Anyway keep up the nice quality writing, it’s rare to see a great blog like this one these days.|
    I am extremely impressed together with your writing skills and also with the layout on your blog. Is this a paid theme or did you modify it yourself? Anyway keep up the excellent high quality writing, it’s uncommon
    to see a nice weblog like this one today..|
    Hello, Neat post. There is a problem together with your website in internet explorer, would test this?
    IE still is the market leader and a good part of other folks will leave out
    your magnificent writing due to this problem.|
    I’m not sure where you are getting your information, but good topic. I needs to spend some time learning more or understanding more. Thanks for fantastic info I was looking for this info for my mission.|
    Hi, i think that i saw you visited my blog thus i came to “return the favor”.I am attempting to find things to improve my site!I suppose its ok to use some of your id\

  • Your post is not only interesting but informative too!

  • vigneshwaran says:

    Thanks dude.your reviews very nice.

    Hadoop Training in Chennai

Leave a Reply