The following steps are used to set up a Hadoop cluster with two linux machines. Before proceed to the cluster it is convenient if both the machines have already set-up for the single node, that we can quickly go for the cluster with minimum modifications and less hassle.
It is recommended to follow the same paths and installation locations in each machine, when setting up the single node cluster. This will make our lives easy in installation as well as in catching up any problems later at execution. For example if we follow same paths and installation in each machine (i.e. hpuser/hadoop/), we can just follow all the steps in single-node set-up procedure for one machine and if the folder is copied to other machines, no modification is needed to the paths.
1.1 Single Node Set-up in Each Machine
Obviously the two nodes needs to be networked that they can communicate with each other. We can connect them via a network cable or any other options. For the proceedings we just need the IP addresses of the two machines in the established connection. I have selected 192.168.0.1 as the master machine and 192.168.0.2 as a slave. Then we need to add these in '/etc/hosts' file of each machine as follows.
Note: The addition of more slaves should be updated here in each machine using unique names for slaves (eg: slave01, slave02).
1.2 Enable SSH Access
We did this step in single node set up for each machine to create a secured channel between the localhost and hpuser. Now we need to make the hpuser in master, is capable of connecting to the hpuser account in slave via a password-less SSH login. We can do this by adding the public SSH key of hpuser in master to the authorized_keys of hpuser in slave. Following command from hpuser at master will do the work .
Note: If more slaves are present this needs to be repeated for them. This will prompt for the password of hpuser of slave and once given we are done. To test we can try to connect from master to master and master to slave as per our requirement as follows.
If a similar kind of output is given for 'ssh master' we can proceed to next steps.
2. Hadoop Configurations
We have to do the following modifications in the configuration files.
In master machine
This file is defining in which nodes are the secondary NameNodes are starting, when bin/start-dfs.sh is run. The duty of secondary NameNode is to merge the edit logs periodically and keeping the edit log size within a limit.
This file lists the hosts that act as slaves processing and storing data. As we are just having two nodes we are using the storage of master too.
Note: If more slaves are present those should be listed in this file of all the machines.
In all machines
We are changing the 'localhost' to master as we can now specifically mention to use master as NameNode.
We are changing the 'localhost' to master as we can now specifically mention to use master as JobTracker.
It is recommended to keep the replication factor not above the number of nodes. We are here setting it to 2.
Format HDFS from the NameNode
Initially we need to format HDFS as we did in the single node set up too.
If the output for the command ended up as above we are done with formatting the file system and ready to run the cluster.
Running the Multi Node Cluster
Starting the cluster is done in an order that first starts the HDFS daemons(NameNode, Datanode) and then the Map-reduce daemons(JobTracker, TaskTracker).
Also it is worth to notice that we can observe what is going on in slaves when we run commands in master from the logs directory inside the HADOOP_HOME of the slaves.
1. Start HDFS daemons - bin/start-dfs.sh in master
This will get the HDFS up with NameNode and DataNodes listed in conf/slaves.
At this moment, java processes running on master and slaves will be as follows.
2. Start Map-reduce daemons - bin/start-mapred.sh in master
Now jps at master will show up TaskTracker and JobTracker as running Java processes in addition to the previously observed processes. At slaves jps will additionally show TaskTracker.
Stopping the cluster is done in the reverse order as of the start. So first Map-reduce daemons are stopped with bin/stop-mapred.sh in master and then bin/stop-dfs.sh should be executed from the master.
Now we know how to start and stop a Hadoop multi node cluster. Now it's time to get some work done.
Running a Map-reduce Job
This is identical to the steps followed in single node set-up, but we can use a much larger volume of data as inputs as we are running in a cluster.
The above command give a similar output as of single node set-up. In addition we can observe how each slave have completed mapped tasks and reduced the final results from the logs.
Now we have executed a map-reduce job in a Hadoop multi-node cluster..