No Batteries Required - Home
Working with Hadoop: A practical Guide – Part 2
Ray Kahn
AUG 02, 2013 10:20 AM
A+ A A-

In part 1 I set out to provide a high level overview of Hadoop, albeit a very brief and simplified one. Today I will discuss core Hadoop components as well as my choice of an installation distribution along with online resources for a more complete list of Hadoop distributions.

Core Components

Hadoop is a platform that provides for distributed storage and computation capability. It is based on distributed master-slave architecture. This architecture has two important components: 1) a file system for storage, Hadoop Distributed File System or HDFS, and 2) MapReduce for computing jobs/tasks. Hadoop partitions and runs parallel tasks/jobs on very large data sets. In fact Hadoop is NOT very efficient when working with small data sets.

Briefly I will explain these two components below.

 What is HDFS?

HDFS is the storage component of Hadoop. It is very scalable and has a high degree of availability. It is distributed and optimized for high throughput. Again, HDFS works best when working (reading/writing) with large files. HDFS uses very large filesystem block sizes and uses data locality to reduce network throughput.

HDFS is fault tolerant when it comes to software and hardware failure. It also replicates data blocks based on your configuration. If there is any hardware or software failure Hadoop re-replicates data blocks on nodes which have failed to the ones which are available or idle.

What is MapReduce?

MapReduce is a batch-based computing framework. It is distributed and allows for jobs and tasks to run in parallel. Since it is a framework it implies that you don’t need to worry about the parallelization of jobs and the complexities of a distributed system; work distribution is handled by MapReduce framework and programmers can instead concentrate on writing applications that address the business needs.

MapReduce breaks a job into two components:  1) Map component takes a record from a table or a line from a file and produces key/value pairs output based on the program’s specific requirement (a filter for example). 2) Reduce component takes the key/value pairs produced in part 1 and then combine those with the same key to form a single record. For example, all ArrayIndexOutofRange Exceptions encountered by Map functionality are added up in the Reducer functionality to give the number of times that specific exception was encountered.

Which Distribution

I intend to create and run a MapReduce job to analyze our server logs. This program would run nightly and will look for specific exceptions: run time, array index out of bounds, null pointer, etc. Since we have weekly code deployments I need to see the effect of new code on frequency of exceptions.

I have selected Cloudera’s distribution as my choice of Hadoop. Here is the link:

You can select from a long list of distros from the link I have provided below. The choice is yours and I make no recommendation one way or the other.

Next week I will install Cloudera’s Hadoop on my Linux box and will provide step by step instructions.

Online Resources

There are plenty of online resources available on Hadoop, just Google it and you will get more than 9 million results. But one of my favorite ones is the wiki which can be found here: This is the official Apache Hadoop wiki with loads of information. I highly recommend this site.

[%= name %]
[%= createDate %]
[%= comment %]
Share this:
Please login to enter a comment: