With this post I am hoping to share the procedure to set up Apache Hadoop in multi node and is a continuation of the post, Hadoop Single Node Set-up. The given steps are to set up a two node cluster which can be then expanded to more nodes according to the volume of data. The unique capabilities of Hadoop can be well observed when performing on a BIG volume of data in a multi node cluster of commodity hardware.
It will be useful to have a general idea on the HDFS(Hadoop Distributed File System) architecture which is the default data storage for Hadoop, before proceed to the set up, that we can well understand the steps we are following and what is happening at execution. In brief, it is a master-slave architecture where master act as the NameNode which manages file system namespace and slaves act as the DataNodes which manage the storage of each node. Also there are JobTrackers which are master nodes and TaskTrackers which are slave nodes.
This Hadoop document includes the details for setting up Hadoop in a cluster, in brief. I am here sharing a detailed guidance for the set up process with the following line up.
- Hadoop Configurations
- Running the Multi Node Cluster
- Running a Map Reduce Job
The following steps are to set up a Hadoop cluster with two linux machines. Before proceed to the cluster it is convenient if both the machines have already set-up for the single node, that we can quickly go for the cluster with minimum modifications and less hassle.
It is recommended that we follow the same paths and installation locations in each machine, when setting up the single node cluster. This will make our lives easy in installation as well as in catching up any problems later at execution. For example if we follow same paths and installation in each machine (i.e. hpuser/hadoop/), we can just follow all the steps in single-node set-up procedure for one machine and if the folder is copied to other machines, no modification is needed to the paths.
Single Node Set-up in Each Machine
Obviously the two nodes needs to be networked that they can communicate with each other. We can connect them via a network cable or any other options. For the proceedings we just need the IP addresses of the two machines in the established connection. I have selected 192.168.0.1 as the master machine and 192.168.0.2 as a slave. Then we need to add these in '/etc/hosts' file of each machine as follows.
192.168 . 0.1 master 192.168 . 0.2 slave
Note: The addition of more slaves should be updated here in each machine using unique names for slaves (eg: slave01, slave02).
Enable SSH Access
We did this step in single node set up for each machine to create a secured channel between the localhost and hpuser. Now we need to make the hpuser in master, is capable of connecting to the hpuser account in slave via a password-less SSH login. We can do this by adding the public SSH key of hpuser in master to the authorized_keys of hpuser in slave. Following command from hpuser at master will do the work .
hpuser@master :~ $ ssh - copy - id - i $HOME /. ssh / id_rsa . pub hpuser@slaveNote: If more slaves are present this needs to be repeated for them. This will prompt for the password of hpuser of slave and once given we are done. To test we can try to connect from master to master and master to slave as per our requirement as follows.
hpuser@master :~ $ ssh slave The authenticity of host 'slave (192.168.0.2)' can 't be established. RSA key fingerprint is ..............................If a similar kind of output is given for 'ssh master' we can proceed to next steps.
.............................. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added ' slave ( 192.168 . 0.2 ) ' (RSA) to the list of known hosts. hpuser@slave' s password : Welcome to Ubuntu 11.10 ( GNU / Linux 3.0 . 0 - 12 - generic i686 ) .............................. ......
We have to do the following modifications in the configuration files.
In master machine
This file is defining in which nodes are the secondary NameNodes are starting, when bin/start-dfs.sh is run. The duty of secondary NameNode is to merge the edit logs periodically and keeping the edit log size within a limit.
master slaveThis file lists the hosts that act as slaves processing and storing data. As we are just having two nodes we are using the storage of master too.
Note: If more slaves are present those should be listed in this file of all the machines.
In all machines
<property> <name> fs.default.name </name> <value> hdfs://master:54310 </We are changing the 'localhost' to master as we can now specifically mention to use master as NameNode.
value> <description> ..... </description> </property>
<property> <name> mapred.job.tracker </We are changing the 'localhost' to master as we can now specifically mention to use master as JobTracker.
name> <value> master:54311 </value> <description> The host and port that the MapReduce job tracker runs at. </description> </property>
<property> <name> dfs.replication </name> <value> 2 </value> <description> Default number of block replications. </description> </property>It is recommended to keep the replication factor not above the number of nodes. We are here setting it to 2.
Format HDFS from the NameNode
Initially we need to format HDFS as we did in the single node set up too.
hpuser@master :~/ hadoop - 1.0 . 3 $ bin / hadoop namenode - format ..............................If the output for the command ended up as above we are done with formatting the file system and ready to run the cluster.
.......... 12 / 11 / 02 23 : 25 : 54 INFO common . Storage : Storage directory / home / hpuser / temp / dfs / name has been successfully formatted . 12 / 11 / 02 23 : 25 : 54 INFO namenode . NameNode : SHUTDOWN_MSG : /***************************** ****************************** *
Running the Multi Node Cluster
Starting the cluster is done in an order that first starts the HDFS daemons(NameNode, Datanode) and then the Map-reduce daemons(JobTracker, TaskTracker).
Also it is worth to notice that we can observe what is going on in slaves when we run commands in master from the logs directory inside the HADOOP_HOME of the slaves.
1. Start HDFS daemons - bin/start-dfs.sh in master
hpuser@master :~/ hadoop - 1.0 . 3 $ bin / start - dfs . sh starting namenode , logging to ../ bin /../ logs / hadoop - hpuser - nThis will get the HDFS up with NameNode and DataNodes listed in conf/slaves.
amenode - master . out slave : Ubuntu 11.10 slave : starting datanode , logging to .../ bin /../ logs / hadoop - hpuser - datanode - slave . out master : starting datanode , logging to ..../ bin /../ logs / hadoop - hpuser - datanode - master . out master : starting secondarynamenode , logging to .../ bin /../ logs / hadoop - hpuser - secondarynamenode - master . out
At this moment, java processes running on master and slaves will be as follows.
hpuser@master :~/ hadoop - 1.0 . 3 $ jps 5799 NameNode 6614 Jps 5980 DataNode 6177 SecondaryNameNode
hpuser@slave :~/ hadoop - 1.0 . 3 $ jps 6183 DataNode 5916 Jps2. Start Map-reduce daemons - bin/start-mapred.sh in master
hpuser@master :~/ hadoop - 1.0 . 3 $ bin / start - mapred . sh starting jobtracker , logging to .../ bin /../ logs / hadoop - hpuser -Now jps at master will show up TaskTracker and JobTracker as running Java processes in addition to the previously observed processes. At slaves jps will additionally show TaskTracker.
jobtracker - master . out slave : Ubuntu 11.10 slave : starting tasktracker , logging to ../ bin /../ logs / hadoop - hpuser - t asktracker - slave . out master : starting tasktracker , logging to .../ bin /../ logs / hadoop - hpuser - tasktracker - master . out
Stopping the cluster is done in the reverse order as of the start. So first Map-reduce daemons are stopped with bin/stop-mapred.sh in master and then bin/stop-dfs.sh should be executed from the master.
Now we know how to start and stop a Hadoop multi node cluster. Now it's time to get some work done.
Running a Map-reduce Job
This is identical to the steps followed in single node set-up, but we can use a much larger volume of data as inputs as we are running in a cluster.
hpuser@master :~/ hadoop - 1.0 . 3 $ bin / hadoop jar hadoop * examples *. jar wordcount / user / hpuser / testHadoop / user / hpuser / testHadoop - outputThe above command give a similar output as of single node set-up. In addition we can observe how each slave have completed mapped tasks and reduced the final results from the logs.
Now we have executed a map-reduce job in a Hadoop multi-node cluster. Cheers!
The above steps are just the basics. When setting up a cluster for a real scenario, there are several fine tuning that needs to be done, considering the cluster hardware, data size etc. and the best practices to be followed that were not stressed in this post.