Hadoop Multi Node Set Up
With this post I am hoping to share the procedure to set up Apache Hadoop in multi node and is a continuation of the post, Hadoop Single Node Set-up.
The given steps are to set up a two node cluster which can be then
expanded to more nodes according to the volume of data. The unique
capabilities of Hadoop can be well observed when performing on a BIG
volume of data in a multi node cluster of commodity hardware.
It will be useful to have a general idea on the HDFS(Hadoop Distributed File System) architecture which
is the default data storage for Hadoop, before proceed to the set up,
that we can well understand the steps we are following and what is
happening at execution. In brief, it is a master-slave architecture
where master act as the NameNode which manages file system namespace and
slaves act as the DataNodes which manage the storage of each node. Also
there are JobTrackers which are master nodes and TaskTrackers which are
slave nodes.
This Hadoop document includes
the details for setting up Hadoop in a cluster, in brief. I am here
sharing a detailed guidance for the set up process with the following
line up.
- Pre-requisites
- Hadoop Configurations
- Running the Multi Node Cluster
- Running a Map Reduce Job
Pre-requisites
The
following steps are to set up a Hadoop cluster with two linux machines.
Before proceed to the cluster it is convenient if both the machines
have already set-up for the single node, that we can quickly go for the
cluster with minimum modifications and less hassle.
It
is recommended that we follow the same paths and installation locations
in each machine, when setting up the single node cluster. This will
make our lives easy in installation as well as in catching up any
problems later at execution. For example if we follow same paths and
installation in each machine (i.e. hpuser/hadoop/), we can just follow
all the steps in single-node set-up procedure for one machine and if the
folder is copied to other machines, no modification is needed to the
paths.
Single Node Set-up in Each Machine
Networking
Obviously
the two nodes needs to be networked that they can communicate with each
other. We can connect them via a network cable or any other options.
For the proceedings we just need the IP addresses of the two machines in
the established connection. I have selected 192.168.0.1 as the master
machine and 192.168.0.2 as a slave. Then we need to add these in
'/etc/hosts' file of each machine as follows.
192.168 . 0.1 master
192.168 . 0.2 slave
Note: The addition of more slaves should be updated here in each machine using unique names for slaves (eg: slave01, slave02).
Enable SSH Access
We
did this step in single node set up for each machine to create a
secured channel between the localhost and hpuser. Now we need to make
the hpuser in master, is capable of connecting to the hpuser account in
slave via a password-less SSH login. We can do this by adding the public
SSH key of hpuser in master to the authorized_keys of hpuser in slave.
Following command from hpuser at master will do the work .
hpuser@master :~ $ ssh - copy - id - i $HOME /. ssh / id_rsa . pub hpuser@slave
Note: If more slaves are present this needs to be repeated for them. This
will prompt for the password of hpuser of slave and once given we are
done. To test we can try to connect from master to master and master to
slave as per our requirement as follows. hpuser@master :~ $ ssh slave
The authenticity of host 'slave (192.168.0.2)' can 't be established.
RSA key fingerprint is .............................. ..............................
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added ' slave ( 192.168 . 0.2 ) ' (RSA) to the list of known hosts.
hpuser@slave' s password :
Welcome to Ubuntu 11.10 ( GNU / Linux 3.0 . 0 - 12 - generic i686 )
.............................. ......
If a similar kind of output is given for 'ssh master' we can proceed to next steps. Hadoop Configurations
We have to do the following modifications in the configuration files.
In master machine
1. conf/masters
2. conf/slaves
master
This
file is defining in which nodes are the secondary NameNodes are
starting, when bin/start-dfs.sh is run. The duty of secondary NameNode
is to merge the edit logs periodically and keeping the edit log size
within a limit.
master
slave
This file lists the hosts that act as slaves
processing and storing data. As we are just having two nodes we are
using the storage of master too.
Note: If more slaves are present those should be listed in this file of all the machines.
In all machines
1. conf/core-site.xml
<property>
<name> fs.default.name </name>
<value> hdfs://master:54310 </ value>
<description> ..... </description>
</property>
We are changing the 'localhost' to master as we can now specifically mention to use master as NameNode.
2. conf/mapred-site.xml
<property>
<name> mapred.job.tracker </ name>
<value> master:54311 </value>
<description> The host and port that the MapReduce job tracker runs
at. </description>
</property>
We are changing the 'localhost' to master as we can now specifically mention to use master as JobTracker.
3. conf/hdfs-site.xml
<property>
<name> dfs.replication </name>
<value> 2 </value>
<description> Default number of block replications.
</description>
</property>
It is recommended to keep the replication factor not above the number of nodes. We are here setting it to 2.Format HDFS from the NameNode
Initially we need to format HDFS as we did in the single node set up too.
hpuser@master :~/ hadoop - 1.0 . 3 $ bin / hadoop namenode - format
.............................. ..........
12 / 11 / 02 23 : 25 : 54 INFO common . Storage : Storage directory / home / hpuser / temp / dfs / name has been successfully formatted .
12 / 11 / 02 23 : 25 : 54 INFO namenode . NameNode : SHUTDOWN_MSG :
/***************************** ****************************** *
If the output for the command ended up as above we are done with formatting the file system and ready to run the cluster.Running the Multi Node Cluster
Starting the cluster is done in
an order that first starts the HDFS daemons(NameNode, Datanode) and then
the Map-reduce daemons(JobTracker, TaskTracker).
Also it is
worth to notice that we can observe what is going on in slaves when we
run commands in master from the logs directory inside the HADOOP_HOME of
the slaves.
1. Start HDFS daemons - bin/start-dfs.sh in master
hpuser@master :~/ hadoop - 1.0 . 3 $ bin / start - dfs . sh
starting namenode , logging to ../ bin /../ logs / hadoop - hpuser - n amenode - master . out
slave : Ubuntu 11.10
slave : starting datanode , logging to .../ bin /../ logs / hadoop - hpuser - datanode - slave . out
master : starting datanode , logging to ..../ bin /../ logs / hadoop - hpuser - datanode - master . out
master : starting secondarynamenode , logging to .../ bin /../ logs / hadoop - hpuser - secondarynamenode - master . out
This will get the HDFS up with NameNode and DataNodes listed in conf/slaves.
At this moment, java processes running on master and slaves will be as follows.
hpuser@master :~/ hadoop - 1.0 . 3 $ jps
5799 NameNode
6614 Jps
5980 DataNode
6177 SecondaryNameNode
hpuser@slave :~/ hadoop - 1.0 . 3 $ jps
6183 DataNode
5916 Jps
2. Start Map-reduce daemons - bin/start-mapred.sh in master hpuser@master :~/ hadoop - 1.0 . 3 $ bin / start - mapred . sh
starting jobtracker , logging to .../ bin /../ logs / hadoop - hpuser - jobtracker - master . out
slave : Ubuntu 11.10
slave : starting tasktracker , logging to ../ bin /../ logs / hadoop - hpuser - t asktracker - slave . out
master : starting tasktracker , logging to .../ bin /../ logs / hadoop - hpuser - tasktracker - master . out
Now
jps at master will show up TaskTracker and JobTracker as running Java
processes in addition to the previously observed processes. At slaves jps will additionally show TaskTracker.
Stopping
the cluster is done in the reverse order as of the start. So first
Map-reduce daemons are stopped with bin/stop-mapred.sh in master and
then bin/stop-dfs.sh should be executed from the master.
Now we know how to start and stop a Hadoop multi node cluster. Now it's time to get some work done.
Running a Map-reduce Job
This is identical
to the steps followed in single node set-up, but we can use a much
larger volume of data as inputs as we are running in a cluster.
hpuser@master :~/ hadoop - 1.0 . 3 $ bin / hadoop jar hadoop * examples *. jar wordcount / user / hpuser / testHadoop / user / hpuser / testHadoop - output
The
above command give a similar output as of single node set-up. In
addition we can observe how each slave have completed mapped tasks and
reduced the final results from the logs.
Now we have executed a map-reduce job in a Hadoop multi-node cluster. Cheers!
The
above steps are just the basics. When setting up a cluster for a real
scenario, there are several fine tuning that needs to be done,
considering the cluster hardware, data size etc. and the best practices
to be followed that were not stressed in this post.