Wednesday, July 03, 2013

Hadoop Multi Node Set Up

   With this post I am hoping to share the procedure to set up Apache Hadoop in multi node and is a continuation of the post, Hadoop Single Node Set-up. The given steps are to set up a two node cluster which can be then expanded to more nodes according to the volume of data. The unique capabilities of Hadoop can be well observed when performing on a BIG volume of data in a multi node cluster of commodity hardware.
   It will be useful to have a general idea on the HDFS(Hadoop Distributed File System) architecture which is the default data storage for Hadoop, before proceed to the set up, that we can well understand the steps we are following and what is happening at execution. In brief, it is a master-slave architecture where master act as the NameNode which manages file system namespace and slaves act as the DataNodes which manage the storage of each node. Also there are JobTrackers which are master nodes and TaskTrackers which are slave nodes.
   This Hadoop document includes the details for setting up Hadoop in a cluster, in brief. I am here sharing a detailed guidance for the set up process with the following line up.
  • Pre-requisites
  • Hadoop Configurations
  • Running the Multi Node Cluster
  • Running a Map Reduce Job

Pre-requisites


   The following steps are to set up a Hadoop cluster with two linux machines. Before proceed to the cluster it is convenient if both the machines have already set-up for the single node, that we can quickly go for the cluster with minimum modifications and less hassle. 
   It is recommended that we follow the same paths and installation locations in each machine, when setting up the single node cluster. This will make our lives easy in installation as well as in catching up any problems later at execution. For example if we follow same paths and installation in each machine (i.e. hpuser/hadoop/), we can just follow all the steps in single-node set-up procedure for one machine and if the folder is copied to other machines, no modification is needed to the paths.
  


Single Node Set-up in Each Machine


Networking


  Obviously the two nodes needs to be networked that they can communicate with each other. We can connect them via a network cable or any other options. For the proceedings we just need the IP addresses of the two machines in the established connection. I have selected 192.168.0.1 as the master machine and 192.168.0.2 as a slave. Then we need to add these in '/etc/hosts' file of each machine as follows.
 
 192.168 .  0.1     master
  192.168 .  0.2     slave 

 
Note: The addition of more slaves should be updated here in each machine using unique names for slaves (eg: slave01, slave02).

Enable SSH Access

   We did this step in single node set up for each machine to create a secured channel between the localhost and hpuser. Now we need to make the hpuser in master, is capable of connecting to the hpuser account in slave via a password-less SSH login. We can do this by adding the public SSH key of hpuser in master to the authorized_keys of hpuser in slave. Following command from hpuser at master will do the work .
hpuser@master :~ $ ssh - copy - id  - i $HOME /. ssh / id_rsa . pub hpuser@slave 
Note: If more slaves are present this needs to be repeated for them. This will prompt for the password of hpuser of slave and once given we are done. To test we can try to connect from master to master and master to slave as per our requirement as follows.

hpuser@master :~ $ ssh slave
 The  authenticity of host  'slave (192.168.0.2)'  can 't be established.
RSA key fingerprint is ............................................................
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added ' slave  (  192.168 .  0.2 ) ' (RSA) to the list of known hosts.
hpuser@slave' s password :  
 Welcome  to  Ubuntu    11.10   ( GNU / Linux    3.0 .  0 -  12 - generic i686 ) 
 .................................... 
If a similar kind of output is given for 'ssh master' we can proceed to next steps.

Hadoop Configurations

We have to do the following modifications in the configuration files.

In master machine

1. conf/masters
master 
This file is defining in which nodes are the secondary NameNodes are starting, when bin/start-dfs.sh is run. The duty of secondary NameNode is to merge the edit logs periodically and keeping the edit log size within a limit.

2. conf/slaves

master
slave 
   This file lists the hosts that act as slaves processing and storing data. As we are just having two nodes we are using the storage of master too.
Note: If more slaves are present those should be listed in this file of all the machines.




In all machines

1. conf/core-site.xml


<property> 
   <name> fs.default.name </name> 
   <value> hdfs://master:54310 </value> 
   <description> .....  </description> 
 </property> 
We are changing the 'localhost' to master as we can now specifically mention to use master as NameNode.

2. conf/mapred-site.xml
<property> 
   <name> mapred.job.tracker </name> 
   <value> master:54311 </value> 
   <description> The host and port that the MapReduce job tracker runs
  at.  </description> 
 </property> 
We are changing the 'localhost' to master as we can now specifically mention to use master as JobTracker.

3. conf/hdfs-site.xml
<property> 
   <name> dfs.replication </name> 
   <value> 2 </value> 
   <description> Default number of block replications.
    </description> 
 </property> 
It is recommended to keep the replication factor not above the number of nodes. We are here setting it to 2.

Format HDFS from the NameNode

Initially we need to format HDFS as we did in the single node set up too.
hpuser@master :~/ hadoop -  1.0 .  3 $ bin / hadoop namenode  - format

 ........................................ 
  12 /  11 /  02    23 :  25 :  54  INFO common . Storage :   Storage  directory  / home / hpuser / temp / dfs / name has been successfully formatted . 
  12 /  11 /  02    23 :  25 :  54  INFO namenode . NameNode :  SHUTDOWN_MSG :  
 /************************************************************
 
If the output for the command ended up as above we are done with formatting the file system and ready to run the cluster.

Running the Multi Node Cluster

   Starting the cluster is done in an order that first starts the HDFS daemons(NameNode, Datanode) and then the Map-reduce daemons(JobTracker, TaskTracker). 
Also it is worth to notice that we can observe what is going on in slaves when we run commands in master from the logs directory inside the HADOOP_HOME of the slaves.

1. Start HDFS daemons - bin/start-dfs.sh in master

 hpuser@master :~/ hadoop -  1.0 .  3  $ bin / start - dfs . sh
starting namenode ,  logging to  ../ bin /../ logs / hadoop - hpuser - namenode - master . out 
slave :   Ubuntu    11.10 
slave :  starting datanode ,  logging to  .../ bin /../ logs / hadoop - hpuser - datanode - slave . out 
master :  starting datanode ,  logging to  ..../ bin /../ logs / hadoop - hpuser - datanode - master . out 
master :  starting secondarynamenode ,  logging to  .../ bin /../ logs / hadoop - hpuser - secondarynamenode - master . out 
This will get the HDFS up with NameNode and DataNodes listed in conf/slaves.
At this moment, java processes running on master and slaves will be as follows.

hpuser@master :~/ hadoop -  1.0 .  3 $ jps
  5799   NameNode 
  6614   Jps 
  5980   DataNode 
  6177   SecondaryNameNode 
hpuser@slave :~/ hadoop -  1.0 .  3 $ jps
  6183   DataNode 
  5916   Jps 
2. Start Map-reduce daemons - bin/start-mapred.sh in master

hpuser@master :~/ hadoop -  1.0 .  3 $ bin / start - mapred . sh
starting jobtracker ,  logging to  .../ bin /../ logs / hadoop - hpuser - jobtracker - master . out 
slave :   Ubuntu    11.10 
slave :  starting tasktracker ,  logging to  ../ bin /../ logs / hadoop - hpuser - tasktracker - slave . out 
master :  starting tasktracker ,  logging to  .../ bin /../ logs / hadoop - hpuser - tasktracker - master . out 
Now jps at master will show up TaskTracker and JobTracker as running Java processes in addition to the previously observed processes. At slaves jps will additionally show TaskTracker.
   Stopping the cluster is done in the reverse order as of the start. So first Map-reduce daemons are stopped with bin/stop-mapred.sh in master and then bin/stop-dfs.sh should be executed from the master.
Now we know how to start and stop a Hadoop multi node cluster. Now it's time to get some work done.

Running a Map-reduce Job

   This is identical to the steps followed in single node set-up, but we can use a much larger volume of data as inputs as we are running in a cluster.  

hpuser@master :~/ hadoop -  1.0 .  3 $ bin / hadoop jar hadoop * examples *. jar wordcount  / user / hpuser / testHadoop  / user / hpuser / testHadoop - output 
   The above command give a similar output as of single node set-up. In addition we can observe how each slave have completed mapped tasks and reduced the final results from the logs.
Now we have executed a map-reduce job in a Hadoop multi-node cluster. Cheers!
  The above steps are just the basics. When setting up a cluster for a real scenario, there are several fine tuning that needs to be done, considering the cluster hardware, data size etc. and the best practices to be followed that were not stressed in this post.