No Batteries Required - Home
Working with Hadoop: A Practical Guide – Part 4
Ray Kahn
AUG 19, 2013 12:40 PM
A+ A A-

As you recall in my earlier blog I stated that my client servers may need to have their memory upgraded. Of course it didn’t take me long to do just that as my standalone server became unresponsive as I tried to run too many processes on it. As a result I added an additional 1 Gig of RAM to that client server (I will refer to it as the less than optimal client henceforth). I also had to replace another one of my servers as it was too weak to be of any use for the type of jobs I’ll be running. Instead I added a quad core PC with 250 Gig HD and 4 Gig of RAM to my cluster. Again I would like to reemphasize that I am using what I already have and not purchasing any new hardware specifically for my analytics project.

Virtualizing My Servers

In my cluster I have a master node, a less than optimal PC, a quad core PC and a laptop. My laptop is also a quad core computer with 8 Gigs of RAM and plenty of space. The logical thing to do is to virtualize those computers with plenty of resources. By doing so I am only adding additional client nodes to my cluster: a win-win situation.

I took the advice of one of my teammates and installed Virtual Box, an Open Source virtualization product from Oracle. Working with this product is very easy and its installation is seamless. Of course you may want to use your choice of virtualization platform. One thing to note when building your virtual servers is to make sure you assign enough memory to them. By “enough” I mean not too little and not too much, as I did with one of the servers that I built. VirtualBox warns you if you have assigned too much memory to a single server, however the only way to change the configuration of a server is through command line. But again it is very easy to do. For more information about VirtualBox’s command line options see its user manual here, or just type “VBoxManage –h” in a terminal to learn about those options.

Hardware Recommendations

As you recall I have installed Cloudera’s Hadoop distribution. So it is only fitting to write about “official” recommendations for an optimal Hadoop server configuration. Hadoop works with commodity hardware (basically anything off the shelf would do). However a server’s physical architecture (RAM, CPU, disk and network card) do impact the performance of the cluster you are building. So the recommendation is for “mid-level rack servers with dual sockets, as much error-correcting RAM and SATA drives with RAID optimization.” Also, don’t use RAID optimized drives for the DataNodes since HDFS already has error-checking and replication built-in.

From a network perspective all the master and slave nodes must be able to talk to one another. And if your cluster is not very big then a 1 GB network card should be good enough.

Physical Architecture

A more detailed explanation of Hadoop’s physical architecture is necessary here. Hadoop is divided into three distinct parts:

1.       Client hosts:  These servers run application code such as Pig (“a high-level data-flow language and execution framework for parallel computation”), Hive (“a data warehouse infrastructure which allows sql-like ad hoc querying of data”) and Mahout (“scalable Machine Learning algorithms”). These applications don’t need to be installed on your cluster.

2.       Master nodes:  There are 2 of these. HDFS, MapReduce (“key algorithm that the Hadoop MapReduce engine uses to distribute work around a cluster“) and HBase (“a Bigtable-like structured storage system”) daemons run on the primary master node. Also running on the primary are NameNode (“a directory tree of all files in the file system”), ZooKeeper (“A high-performance coordination service for distributed applications”) which is used by HBase for metadata storage and JobTracker (“a service that farms out MapReduce tasks to specific nodes in the cluster”). The secondary master provides a secondary NameNode, used in this context as a checkpoint management service, and a secondary ZooKeeper.

3.       Slave nodes: These servers run DataNode (“stores data in the HDFS”), TaskTracker (“a daemon that manages accepted tasks”), RegionServer (handles requests to slave node by first writing it to memory and commit log and eventually to permanent storage in HDFS), RHIPE (“R and Hadoop Integrated Programming Environment”), R (“interactive language and environment for data analysis”), and RHadoop (“packages that allow users to manage and analyze data with Hadoop”).


What’s next?

Next week I will write about starting, configuring and running a simple MapReduce job. There are a few important configuration facts which I will also discuss.

If you or your company is interested in more information on this topic and other topics, be sure to keep reading my blog.  

Also, as I am with the IEEE Computer Society,  I should mention there are technology resources available 24/7 and specific training on custom topics available. Here is the IEEE CS program link if you are interested, TechLeader Training Partner Program,


[%= name %]
[%= createDate %]
[%= comment %]
Share this:
Please login to enter a comment: