Wednesday, July 03, 2013

Hadoop Multi Node Set Up

   With this post I am hoping to share the procedure to set up Apache Hadoop in multi node and is a continuation of the post, Hadoop Single Node Set-up. The given steps are to set up a two node cluster which can be then expanded to more nodes according to the volume of data. The unique capabilities of Hadoop can be well observed when performing on a BIG volume of data in a multi node cluster of commodity hardware.
   It will be useful to have a general idea on the HDFS(Hadoop Distributed File System) architecture which is the default data storage for Hadoop, before proceed to the set up, that we can well understand the steps we are following and what is happening at execution. In brief, it is a master-slave architecture where master act as the NameNode which manages file system namespace and slaves act as the DataNodes which manage the storage of each node. Also there are JobTrackers which are master nodes and TaskTrackers which are slave nodes.
   This Hadoop document includes the details for setting up Hadoop in a cluster, in brief. I am here sharing a detailed guidance for the set up process with the following line up.
  • Pre-requisites
  • Hadoop Configurations
  • Running the Multi Node Cluster
  • Running a Map Reduce Job

Pre-requisites


   The following steps are to set up a Hadoop cluster with two linux machines. Before proceed to the cluster it is convenient if both the machines have already set-up for the single node, that we can quickly go for the cluster with minimum modifications and less hassle. 
   It is recommended that we follow the same paths and installation locations in each machine, when setting up the single node cluster. This will make our lives easy in installation as well as in catching up any problems later at execution. For example if we follow same paths and installation in each machine (i.e. hpuser/hadoop/), we can just follow all the steps in single-node set-up procedure for one machine and if the folder is copied to other machines, no modification is needed to the paths.
  


Single Node Set-up in Each Machine


Networking


  Obviously the two nodes needs to be networked that they can communicate with each other. We can connect them via a network cable or any other options. For the proceedings we just need the IP addresses of the two machines in the established connection. I have selected 192.168.0.1 as the master machine and 192.168.0.2 as a slave. Then we need to add these in '/etc/hosts' file of each machine as follows.
 
 192.168 .  0.1     master
  192.168 .  0.2     slave 

 
Note: The addition of more slaves should be updated here in each machine using unique names for slaves (eg: slave01, slave02).

Enable SSH Access

   We did this step in single node set up for each machine to create a secured channel between the localhost and hpuser. Now we need to make the hpuser in master, is capable of connecting to the hpuser account in slave via a password-less SSH login. We can do this by adding the public SSH key of hpuser in master to the authorized_keys of hpuser in slave. Following command from hpuser at master will do the work .
hpuser@master :~ $ ssh - copy - id  - i $HOME /. ssh / id_rsa . pub hpuser@slave 
Note: If more slaves are present this needs to be repeated for them. This will prompt for the password of hpuser of slave and once given we are done. To test we can try to connect from master to master and master to slave as per our requirement as follows.

hpuser@master :~ $ ssh slave
 The  authenticity of host  'slave (192.168.0.2)'  can 't be established.
RSA key fingerprint is ............................................................
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added ' slave  (  192.168 .  0.2 ) ' (RSA) to the list of known hosts.
hpuser@slave' s password :  
 Welcome  to  Ubuntu    11.10   ( GNU / Linux    3.0 .  0 -  12 - generic i686 ) 
 .................................... 
If a similar kind of output is given for 'ssh master' we can proceed to next steps.

Hadoop Configurations

We have to do the following modifications in the configuration files.

In master machine

1. conf/masters
master 
This file is defining in which nodes are the secondary NameNodes are starting, when bin/start-dfs.sh is run. The duty of secondary NameNode is to merge the edit logs periodically and keeping the edit log size within a limit.

2. conf/slaves

master
slave 
   This file lists the hosts that act as slaves processing and storing data. As we are just having two nodes we are using the storage of master too.
Note: If more slaves are present those should be listed in this file of all the machines.




In all machines

1. conf/core-site.xml


<property> 
   <name> fs.default.name </name> 
   <value> hdfs://master:54310 </value> 
   <description> .....  </description> 
 </property> 
We are changing the 'localhost' to master as we can now specifically mention to use master as NameNode.

2. conf/mapred-site.xml
<property> 
   <name> mapred.job.tracker </name> 
   <value> master:54311 </value> 
   <description> The host and port that the MapReduce job tracker runs
  at.  </description> 
 </property> 
We are changing the 'localhost' to master as we can now specifically mention to use master as JobTracker.

3. conf/hdfs-site.xml
<property> 
   <name> dfs.replication </name> 
   <value> 2 </value> 
   <description> Default number of block replications.
    </description> 
 </property> 
It is recommended to keep the replication factor not above the number of nodes. We are here setting it to 2.

Format HDFS from the NameNode

Initially we need to format HDFS as we did in the single node set up too.
hpuser@master :~/ hadoop -  1.0 .  3 $ bin / hadoop namenode  - format

 ........................................ 
  12 /  11 /  02    23 :  25 :  54  INFO common . Storage :   Storage  directory  / home / hpuser / temp / dfs / name has been successfully formatted . 
  12 /  11 /  02    23 :  25 :  54  INFO namenode . NameNode :  SHUTDOWN_MSG :  
 /************************************************************
 
If the output for the command ended up as above we are done with formatting the file system and ready to run the cluster.

Running the Multi Node Cluster

   Starting the cluster is done in an order that first starts the HDFS daemons(NameNode, Datanode) and then the Map-reduce daemons(JobTracker, TaskTracker). 
Also it is worth to notice that we can observe what is going on in slaves when we run commands in master from the logs directory inside the HADOOP_HOME of the slaves.

1. Start HDFS daemons - bin/start-dfs.sh in master

 hpuser@master :~/ hadoop -  1.0 .  3  $ bin / start - dfs . sh
starting namenode ,  logging to  ../ bin /../ logs / hadoop - hpuser - namenode - master . out 
slave :   Ubuntu    11.10 
slave :  starting datanode ,  logging to  .../ bin /../ logs / hadoop - hpuser - datanode - slave . out 
master :  starting datanode ,  logging to  ..../ bin /../ logs / hadoop - hpuser - datanode - master . out 
master :  starting secondarynamenode ,  logging to  .../ bin /../ logs / hadoop - hpuser - secondarynamenode - master . out 
This will get the HDFS up with NameNode and DataNodes listed in conf/slaves.
At this moment, java processes running on master and slaves will be as follows.

hpuser@master :~/ hadoop -  1.0 .  3 $ jps
  5799   NameNode 
  6614   Jps 
  5980   DataNode 
  6177   SecondaryNameNode 
hpuser@slave :~/ hadoop -  1.0 .  3 $ jps
  6183   DataNode 
  5916   Jps 
2. Start Map-reduce daemons - bin/start-mapred.sh in master

hpuser@master :~/ hadoop -  1.0 .  3 $ bin / start - mapred . sh
starting jobtracker ,  logging to  .../ bin /../ logs / hadoop - hpuser - jobtracker - master . out 
slave :   Ubuntu    11.10 
slave :  starting tasktracker ,  logging to  ../ bin /../ logs / hadoop - hpuser - tasktracker - slave . out 
master :  starting tasktracker ,  logging to  .../ bin /../ logs / hadoop - hpuser - tasktracker - master . out 
Now jps at master will show up TaskTracker and JobTracker as running Java processes in addition to the previously observed processes. At slaves jps will additionally show TaskTracker.
   Stopping the cluster is done in the reverse order as of the start. So first Map-reduce daemons are stopped with bin/stop-mapred.sh in master and then bin/stop-dfs.sh should be executed from the master.
Now we know how to start and stop a Hadoop multi node cluster. Now it's time to get some work done.

Running a Map-reduce Job

   This is identical to the steps followed in single node set-up, but we can use a much larger volume of data as inputs as we are running in a cluster.  

hpuser@master :~/ hadoop -  1.0 .  3 $ bin / hadoop jar hadoop * examples *. jar wordcount  / user / hpuser / testHadoop  / user / hpuser / testHadoop - output 
   The above command give a similar output as of single node set-up. In addition we can observe how each slave have completed mapped tasks and reduced the final results from the logs.
Now we have executed a map-reduce job in a Hadoop multi-node cluster. Cheers!
  The above steps are just the basics. When setting up a cluster for a real scenario, there are several fine tuning that needs to be done, considering the cluster hardware, data size etc. and the best practices to be followed that were not stressed in this post.

37 comments :

  1. nicely written post, could you please throw in some light about having the secondary name node on a separate machine (not with Master Name node)

    ReplyDelete
  2. Because that is the whole concept behind a map-reduce framework. We need to utilize commodity hardware instead of using costly super computers to achieve high performance, which basically means is a cluster. Of course we can run this in a single machine for testing etc, in pseudo distributed mode. But it's not actually utilizing the power of hadoop.

    ReplyDelete
  3. Any idea, why we get following error on slave machine after stating hdfs demon on Master machine

    2013-07-24 12:10:59,373 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.0.1:54310. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
    2013-07-24 12:11:00,374 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.0.1:54310. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
    2013-07-24 12:11:00,377 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Call to master/192.168.0.1:54310 failed on local exception: java.net.NoRouteToHostException: No route to host
    at org.apache.hadoop.ipc.Client.wrapException(Client.java:1144)
    at org.apache.hadoop.ipc.Client.call(Client.java:1112)

    ReplyDelete
    Replies
    1. any advise @ Pushpalanka Jayawardhana

      Delete
    2. This is some kind of error with your connection. First try and see whether you can ping the master machine from slave

      Delete
    3. Yes i can connect from master to slave
      [hduser@localhost hadoop]$ ssh slave
      Last login: Wed Jul 24 16:35:21 2013 from 192.168.0.1

      Delete
    4. i can even ping from master to slave
      [hduser@localhost ~]$ ping 192.168.1.02
      PING 192.168.0.2 (192.168.0.2) 56(84) bytes of data.
      64 bytes from 192.168.0.2: icmp_req=1 ttl=64 time=0.022 ms
      64 bytes from 192.168.0.2: icmp_req=2 ttl=64 time=0.025 ms
      ^C
      --- 192.168.0.2 ping statistics ---
      2 packets transmitted, 2 received, 0% packet loss, time 999ms
      rtt min/avg/max/mdev = 0.022/0.023/0.025/0.005 ms

      Delete
    5. its an port issue on master, thanks for your blog , keep going @ Pushpalanka Jayawardhana

      Delete
    6. @ surya thanuri.. will you plz help me out how you solved this port issue? plz...

      Delete
  4. This comment has been removed by a blog administrator.

    ReplyDelete
  5. How to copy file from HDFS to the local file system @ @ Pushpalanka Jayawardhana . There is no physical location of a file under the file , not even directory

    ReplyDelete
    Replies
    1. solved by Point my web browser to HDFS WEBUI(namenode_machine:50070), browse to the file you intend to copy, scroll down the page and click on download the file.

      Delete
  6. hi @ Pushpalanka Jayawardhana .
    I am trying to process XML files from hadoop, i got following error on invoking word-count job on XML files .

    13/07/25 12:39:57 INFO mapred.JobClient: Task Id : attempt_201307251234_0001_m_000008_0, Status : FAILED
    Too many fetch-failures
    13/07/25 12:39:58 INFO mapred.JobClient: map 99% reduce 0%
    13/07/25 12:39:59 INFO mapred.JobClient: map 100% reduce 0%
    13/07/25 12:40:56 INFO mapred.JobClient: Task Id : attempt_201307251234_0001_m_000009_0, Status : FAILED
    Too many fetch-failures
    13/07/25 12:40:58 INFO mapred.JobClient: map 99% reduce 0%
    13/07/25 12:40:59 INFO mapred.JobClient: map 100% reduce 0%
    13/07/25 12:41:22 INFO mapred.JobClient: map 100% reduce 1%
    13/07/25 12:41:57 INFO mapred.JobClient: Task Id : attempt_201307251234_0001_m_000015_0, Status : FAILED
    Too many fetch-failures
    13/07/25 12:41:58 INFO mapred.JobClient: map 99% reduce 1%
    13/07/25 12:41:59 INFO mapred.JobClient: map 100% reduce 1%
    13/07/25 12:42:57 INFO mapred.JobClient: Task Id : attempt_201307251234_0001_m_000014_0, Status : FAILED
    Too many fetch-failures
    13/07/25 12:42:58 INFO mapred.JobClient: map 99% reduce 1%
    13/07/25 12:42:59 INFO mapred.JobClient: map 100% reduce 1%
    13/07/25 12:43:22 INFO mapred.JobClient: map 100% reduce 2%


    i observer following error at hadoop-hduser-tasktracker-localhost.localdomain.log file on slave machine .


    2013-07-25 12:38:58,124 WARN org.apache.hadoop.mapred.TaskTracker: getMapOutput(attempt_201307251234_0001_m_000001_0,0) failed :
    org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/hduser/jobcache/job_201307251234_0001/attempt_201307251234_0001_m_000001_0/output/file.out.index in any of the configured local directories
    at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:429)


    This works fine when i ran for text files .


    Can you advise me

    ReplyDelete
  7. This comment has been removed by a blog administrator.

    ReplyDelete
  8. This comment has been removed by a blog administrator.

    ReplyDelete
  9. Is there any possible way to run two slave running one 1 system pointing to two different master.

    Thanks,
    Vadiraj

    ReplyDelete
    Replies
    1. Hi Vadiraj,

      If you have two network interfaces this should be easily possible, with configuring the two slaves to communicate with masters through separate interfaces. Otherwise what Ahmad has suggested below may give some hint.

      Regards,
      Pushpalanka

      Delete
  10. There is possibility to have two namenodes in your cluster, one would be in active and other will be in passive mode (hotstandby). Check this in following link:

    http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html

    ReplyDelete
  11. Thanks you very much Ahmad , i was thinking to run 2 slave service(acts as a secondary name node)on one system which in turn each pointing it to 2 different master name node cluster.

    ReplyDelete
  12. Pushpalanka Jayawardhana @ I forget to see your comments…thank you very much for your help!

    ReplyDelete
  13. Hi Pushpalanka Jayawardhana
    I am trying to create multi node cluster, when starting HDFS daemon all nodes except datanode get started. In both master and slave datanode alone is nt getting started. Can u suggest some ideas.


    The datanode logs look like tis
    2013-09-20 02:48:15,725 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG:
    /************************************************************
    STARTUP_MSG: Starting DataNode
    STARTUP_MSG: host = java.net.UnknownHostException: savitha: savitha
    STARTUP_MSG: args = []
    STARTUP_MSG: version = 0.20.2
    STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
    ************************************************************/
    2013-09-20 02:48:15,906 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.net.UnknownHostException: savitha: savitha
    at java.net.InetAddress.getLocalHost(InetAddress.java:1402)
    at org.apache.hadoop.net.DNS.getDefaultHost(DNS.java:185)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:242)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.(DataNode.java:216)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368)

    2013-09-20 02:48:15,908 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
    /************************************************************
    SHUTDOWN_MSG: Shutting down DataNode at java.net.UnknownHostException: savitha: savitha
    ************************************************************/

    ReplyDelete
    Replies
    1. Hi Savitha,

      It seems an issue in resolving the host name to IP address. Did you update the /etc/hosts file according to your set up correctly? I guess there must be an entry like, savitha.

      Regards,
      Pushpalanka

      Delete
  14. Hi,

    I am using hadoop2.1. There are two issues otherwise it works fine I am able to put files and run jobs. Following are the issues:

    1. Warning "Unable to load native-hadoop library for your platform".
    2. Secondary namenodes ip is 0.0.0.0 I do not know where to configure this.

    Following is output when starting dfs:

    [hduser@masterserver01 hadoop]$ sbin/start-dfs.sh
    13/09/29 18:35:54 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Starting namenodes on [masterserver01]
    masterserver01: starting namenode, logging to /opt/hadoop/hadoop/logs/hadoop-hduser-namenode-masterserver01.out
    slaveserver201: starting datanode, logging to /opt/hadoop/hadoop/logs/hadoop-hduser-datanode-slaveserver201.out
    masterserver01: starting datanode, logging to /opt/hadoop/hadoop/logs/hadoop-hduser-datanode-masterserver01.out
    Starting secondary namenodes [0.0.0.0]
    The authenticity of host '0.0.0.0 (0.0.0.0)' can't be established.
    RSA key fingerprint is c9:d8:39:e9:99:95:16:28:2c:ad:8b:f7:dc:c9:de:75.
    Are you sure you want to continue connecting (yes/no)?

    Best Regards
    Ahmad

    ReplyDelete
  15. This comment has been removed by a blog administrator.

    ReplyDelete
  16. This comment has been removed by a blog administrator.

    ReplyDelete
  17. This comment has been removed by a blog administrator.

    ReplyDelete
  18. I have 3 slaves how can i edit that /etc/hosts file? can anyone plz explain?

    ReplyDelete
    Replies
    1. 192.168 . 0.1 master
      192.168 . 0.2 slave1
      192.168 . 0.3 slave2 ...
      It just need to be the host name of your preference.

      Delete
  19. This comment has been removed by a blog administrator.

    ReplyDelete
  20. i really thanks for your blog and well answers.... :)
    i have one question after setting the cluster with 3 nodes its working fine for me.... felling very happy.
    Now i want to instal the Hive,HBase,Pig,Oozie. Do i need to install in all the machines along with NameNode?
    or if only Name Node what other changes i need to do for data Nodes.
    thanks for your help... :)

    ReplyDelete
  21. Hi, great tutorial. Thanks !!
    I have successfully created a cluster with two nodes. I am using hadoop 1.2.1. Also i am able to ping slave from the master. The only problem which i am facing is that, after i issue the command to start hdfs, it takes a lot of time to set up datanode on the slave and finally it doesnt happen. then i hv to forcefully stop the script. Please suggest as to why this is happening.

    Thanks in advance

    ReplyDelete
  22. is it possible to set up multi node cluster in wireless network????

    ReplyDelete
  23. can't up namenode in multicluster

    ReplyDelete
  24. Can you please update this very very helpful blog to hadoop version 1.7.2. Regards
    and Thanks for the excellent blog.

    ReplyDelete
  25. I meant to say Hadoop Version 2.7.2. Sorry that was a typo.

    ReplyDelete