No Batteries Required - Home
Working with Hadoop: A practical Guide – Part 6
Ray Kahn
OCT 11, 2013 11:01 AM
A+ A A-

So far I have been able to configure a cluster of Hadoop servers using Cloudera distribution. Today I will write about starting small when it comes to Hadoop as a necessary learning step.

To fully grasp and understand the complexities involved when working with Hadoop, it is advisable to begin with a single server. This approach will allow me to learn more about Hadoop configuration and its inner workings.

Pre-Requisites

There are a few things that I will need to do before starting my standalone Hadoop server.

Change hosts & hostname files

Because I will be using my local host as a standalone Hadoop system I need to change a few entries in these two files (this is a Ubuntu centric issue as stated here by Hadoop Wiki)

In my “/etc/hostname” I am changing this entry

www.cloudera-node1.com

To this:

cloudera-node1

And in my “/etc/hosts” I am commenting out a few lines

#127.0.1.1      www.cloudera-node1.com cloudera-node1
#10.32.128.53   www.cloudera-node1.com
#10.32.128.59   www.cloudera-master.com

And changing this line

127.0.0.1       localhost

To this line

127.0.0.1       localhost cloudera-node1

Java

I will need to make sure that my host has Java 6 or higher installed. Of course these days all Linux servers come with Java distribution prepackaged. In my “.bashrc” file I have the following entry:

export JAVA_HOME=/usr/local/java/jre1.7.0_40

SSH

Hadoop requires SSH managing remote servers plus the local machine. And for my “single node” experiment I will need to configure ssh to access localhost. So before proceeding further I have made sure I have ssh installed and running on my local server.

The easiest way for “prompt less” authentication is to create a public key.  The following command creates a SSH for the logged in user:

$ ssh-keygen -t rsa -P ""

I am creating a RSA key with an empty password [-P “”]. The reason for "empty password" is taht I don’t want to be prompted for one when a Hadoop process needs to login to my local host.  My standalone server will act as data node, name node, job and task trackers.

Now I will need to add this newly created key to my authorized SSH key chain:

$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Now just test my key by “sshing” to localhost:

$ ssh localhost

-- Welcome to Ubuntu 12.04.3 LTS (GNU/Linux 3.5.0-40-generic x86_64)

 * Documentation:  https://help.ubuntu.com/

91 packages can be updated.
55 updates are security updates.

Last login: Fri Oct 11 07:49:29 2013 from localhost

Good, it works just as I thought.

Installing Hadoop

Although my host is already part of Cloudera Hadoop cluster, the point of this exercise is to make sure I can indeed run a MapReduce job on a single node before jumping into the maze of cluster and MapReduce jobs management. So I am going to install Hadoop on my host’s “/usr/local” directory.  I downloaded the latest stable binary version of Hadoop from here. Installation is a matter of unpacking a tar file:

$ cd /usr/local

$ sudo tar xzf hadoop-1.0.3.tar.gz

$ sudo mv hadoop-1.0.3 hadoop

It’s that easy. Now I will have to update my “.bashrc” file again to indicate where I placed Hadoop:

export HADOOP_HOME=/usr/local/hadoop

Configuring Hadoop

Hadoop configuration requires editing a few files.  Some are basic and others require a greater understanding of Hadoop ecosystem. Of course, all of this information is available on line. I will be editing the following files:

/usr/local/hadoop/conf/hadoop-env.sh

I added the following in the file:

export JAVA_HOME=/usr/local/java/jre1.7.0_40

/usr/local/hadoop/conf/core-site.xml

I added the following entries to the file, in between the <configuration>…</configuration> tags:

<property>

  <name>hadoop.tmp.dir</name>

  <value>/tmp/hadoop/tmp</value>

</property>

Of course I did have to create the directory as well (however any directory will work):

$ sudo mkdir -p /tmp/hadoop/tmp

$ sudo chmod 750 /tmp/hadoop/tmp

/usr/local/hadoop/conf/mapred-site.xml

I added the following entries to the file, in between the <configuration>…</configuration> tags:

<property>

  <name>mapred.job.tracker</name>

  <value>localhost:54311</value>

</property>

This is the port number that the MapReduce job tracker runs at.

/usr/local/hadoop/conf/hdfs-site.xml:

I added the following entries to the file, in between the <configuration>…</configuration> tags:

<property>

  <name>dfs.replication</name>

  <value>1</value>

</property>

The number of block replications which ensures high availability of data.

Starting & Stopping Hadoop Node

Hadoop already comes with start and stop scripts.  This is my ultimate test and will let me know whether I have configured my Hadoop environment properly and correctly.

$ /usr/local/hadoop/bin/start-all.sh

Warning: $HADOOP_HOME is deprecated.

starting namenode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hduser-namenode-cloudera-node1.out
localhost: starting datanode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hduser-datanode-cloudera-node1.out
localhost: starting secondarynamenode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hduser-secondarynamenode-cloudera-node1.out
starting jobtracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-hduser-jobtracker-cloudera-node1.out
localhost: starting tasktracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-hduser-tasktracker-cloudera-node1.out

Success… What a relief.

Now I am going to stop my Hadoop

$ /usr/local/hadoop/bin/stop-all.sh

Warning: $HADOOP_HOME is deprecated.

stopping jobtracker
localhost: stopping tasktracker
stopping namenode
localhost: stopping datanode
localhost: stopping secondarynamenode

Success again…

What’s next?

Next time I will run my first “simple” MapReduce job.

If you or your company is interested in more information on this topic and other topics, be sure to keep reading my blog.  

Also, as I am with the IEEE Computer Society,  I should mention there are technology resources available 24/7 and specific training on custom topics available. Here is the IEEE CS program link if you are interested, TechLeader Training Partner Program, http://www.computer.org/portal/web/Corporate-Programs.

FIRST
PREV
NEXT
LAST
Page(s):
[%= name %]
[%= createDate %]
[%= comment %]
Share this:
Please login to enter a comment:
 
RESET