Skip to main content

How to Setup Hadoop Multi-Node Cluster on Ubuntu


How to Setup Hadoop Multi-Node Cluster on Ubuntu


In this tutorial, we will learn how to setup a multi-node hadoop cluster on Ubuntu 16.04. A hadoop cluster which has more than 1 datanode is a multi-node hadoop cluster, hence, the goal of this tutorial is to get 2 datanodes up and running.

1) Prerequisites

  • Ubuntu 16.04
  • Hadoop-2.7.3
  • Java 7
  • SSH
For this tutorial, I have two ubuntu 16.04 systems, I call them master and slave system, one datanode will be running on each system.
IP address of Master -> 192.168.1.37
master ip
IP address of Slave -> 192.168.1.38
slave ip
On Master
Edit hosts file with master and slave ip address.
sudo gedit /etc/hosts
Edit the file as below, you may remove other lines in the file. After editing save the file and close it.
master hosts file
On Slave
Edit hosts file with master and slave ip address.
sudo gedit /etc/hosts
Edit the file as below, you may remove other lines in the file. After editing save the file and close it.
slave hosts file

2) Java Installation

Before setting up hadoop, you need to have java installed on your systems. Install open JDK 7 on both ubuntu machines using below commands.
sudo add-apt-repository ppa:openjdk-r/ppa

sudo apt-get update

do apt-get install openjdk-7-jdk
java installation
Run below command to see if java got installed on your system.
java -version
java version
By default java gets stored on /usr/lib/jvm/ directory.
ls /usr/lib/jvm
java path
Set Java path in .bashrc file.
sudo gedit .bashrc
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export PATH=$PATH:/usr/lib/jvm/java-7-openjdk-amd64/bin
Run below command to update the changes made in .bashrc file.
source .bashrc

3) SSH

Hadoop requires SSH access to manage its nodes, therefore we need to install ssh on both master and slave systems.
sudo apt-get install openssh-server
Now, we have to generate an SSH key on master machine. When it asks you to enter a file name to save the key, do not give any name, just press enter.
ssh-keygen -t rsa -P ""
generate key
Second, you have to enable SSH access to your master machine with this newly created key.
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
copy key
Now test the SSH setup by connecting to your local machine.
ssh localhost
ssh localhost
Now run the below command to send the public key generated on master to slave.
ssh-copy-id -i $HOME/.ssh/id_rsa.pub ubuntu@slave
sshcopy
Now that both master and slave have the public key, you can connect master to master and master to slave as well.
ssh master
ssh master
ssh slave
ssh slave
On Master
Edit the masters file as below.
sudo gedit hadoop-2.7.3/etc/hadoop/masters
hadoop masters file
Edit the slaves file as below.
sudo gedit hadoop-2.7.3/etc/hadoop/slaves
hadoop slaves file
On Slave
Edit the masters file as below.
sudo gedit hadoop-2.7.3/etc/hadoop/masters
masters-file

4) Hadoop Installation

Now that we have our java and ssh setup ready. We are good to go and install hadoop on both the systems. Use below link to download hadoop package. I am using the latest stable version hadoop 2.7.3
hadoop releases
On Master
Below command will download hadoop-2.7.3 tar file.
wget https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz
download hadoop
ls
Untar the file
tar -xvf hadoop-2.7.3.tar.gz
untar hadoop
ls
ls command
Confirm that hadoop has got installed on your system.
cd hadoop-2.7.3/
bin/hadoop-2.7.3/
hadoop check
Before setting configurations for hadoop, we will set below environment variables in .bashrc file.
cd
sudo gedit .bashrc
Hadoop environment Variables
# Set Hadoop-related environment variables 
export HADOOP_HOME=$HOME/hadoop-2.7.3
export HADOOP_CONF_DIR=$HOME/hadoop-2.7.3/etc/hadoop
export HADOOP_MAPRED_HOME=$HOME/hadoop-2.7.3 
export HADOOP_COMMON_HOME=$HOME/hadoop-2.7.3 
export HADOOP_HDFS_HOME=$HOME/hadoop-2.7.3
export YARN_HOME=$HOME/hadoop-2.7.3

# Add Hadoop bin/ directory to PATH 
export PATH=$PATH:$HOME/hadoop-2.7.3/bin
bashrc lines
Put below lines at the end of your .bashrc file, save the file and close it.
source .bashrc
Configure JAVA_HOME in ‘hadoop-env.sh’. This file specifies environment variables that affect the JDK used by Apache Hadoop 2.7.3 daemons started by the Hadoop start-up scripts:
cd hadoop-2.7.3/etc/hadoop/
sudo gedit hadoop-env.sh
hadoop env
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 
java-home
Set the java path as shown above, save the file and close it.
Now we will create NameNode and DataNode directories.
cd

mkdir -p $HADOOP_HOME/hadoop2_data/hdfs/namenode

mkdir -p $HADOOP_HOME/hadoop2_data/hdfs/datanode
mkdir hadoop
Hadoop has many of configuration files, which need to configured as per requirements of your hadoop infrastructure. Let us configure hadoop configuration files one by one.
cd hadoop-2.7.3/etc/hadoop/

sudo gedit core-site.xml
Core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value>
</property>
</configuration>
core site file
sudo gedit hdfs-site.xml
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/ubuntu/hadoop-2.7.3/hadoop2_data/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/ubuntu/hadoop-2.7.3/hadoop2_data/hdfs/datanode</value>
</property>
</configuration>
hdfs site file
sudo gedit yarn-site.xml
yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
yarn site
cp mapred-site.xml.template mapred-site.xml

sudo gedit mapred-site.xml
mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
mapred site
Now follow the same hadoop installation and configuring steps on slave machine as well. After you have hadoop installed and configured on both the systems, the first thing in starting up your hadoop cluster is formatting the hadoop file-system, which is implemented on top of the local file-systems of your cluster. This is required at the first time of hadoop installation. Do not format a running hadoop file-system, this will erase all your HDFS data.
On Master
cd

cd hadoop-2.7.3/bin

hadoop namenode -format
namenode format
We are now ready to start the hadoop daemons i.e. NameNode, DataNode, ResourceManager and NodeManageron our Apache Hadoop Cluster.
cd ..
Now run the below command to start the NameNode on master machine and DataNodes on master and slave.
sbin/start-dfs.sh
start dfs
Below command will start YARN daemons, ResourceManager will run on master and NodeManagers will run on master and slave.
sbin/start-yarn.sh
start yarn
Cross check that all the services have started correctly using JPS (Java Process Monitoring Tool). on both master and slave machine.
Below are the daemons running on master machine.
jps
jps master
On Slave
You will see DataNode and NodeManager will be running on slave machine also.
jps
jps slave
Now open you mozilla browser on master machine and go to below URL
Check the NameNode status:  http://master:50070/dfshealth.html
namenode status
If you see '2' in live nodes, that means 2 DataNodes are up and running and you have successfully setup a multi-node hadoop culster.
live nodes

Conclusion

You can add more nodes to your hadoop cluster, all you need to do is add the new slave node ip to slaves file on master, copy ssh key to new slave node, put master ip in masters file on new slave node and then restart the hadoop services. Congratulations!! You have successfully setup a multi-node hadoop cluster.

Comments

Popular posts from this blog

Hbase installation on ubuntu

Hbase installation on ubuntu In this tutorial we will see how to install Hbase on ubuntu 16.04 by doing the following steps Step 1: Before installing Hbase, you need to First ensure that java8 is installed: sudo add-apt-repository ppa:webupd8team/java sudo apt-get update sudo apt-get install oracle-java8-installer Verify that java is correctly installed: java -version       Configuring Java Environment sudo apt-get install oracle-java8-set-default    Step 2: Ensure that you successfully installed hadoop on your machine  Check this link if you need to know how to install it.  Step 3: Download Apache Hbase Go to downloads page Choose hbase file: hbase-1.2.5-bin.tar.gz Step 4: Complete the installation process Move the downloaded file “ hbase-1.2.5-bin.tar.gz ” to your home (~) Compress it :  tar -zxvf hbase-1.2.5-bin.tar.gz Edit hbase-env.sh using this command lines: cd /usr/local/hbase/con

How To Install CouchDB and Futon on Ubuntu 14.04

How To Install CouchDB and Futon on Ubuntu 14.04    Introduction Apache CouchDB , like Redis, Cassandra, and MongoDB, is a NoSQL database . CouchDB stores data as JSON documents which are non-relational in nature. This allows users of CouchDB to store data in ways that look very similar to their real world counterparts. You can manage CouchDB from the command line or from a web interface called Futon. Futon can be used to perform administrative tasks like creating and manipulating databases, documents, and users for CouchDB. Goals By the end of this article, you will: Have CouchDB installed on a Droplet running Ubuntu 14.04 Have Futon installed on the same server Have secured the CouchDB installation Access CouchDB using Futon from your local machine, using a secure tunnel Know how to add an admin user to CouchDB Perform CRUD operations with CouchDB using Futon Perform CRUD operations with CouchDB from the command line Prerequisites Please compl

Loopback - Create datasource and model for Cassandra

Loopback 3.0- Create datasource and model for Cassandra Pre-Installed:-                           Loopback 3.0 and cassandra  Step 1: Creating a Keyspace using Cqlsh cqlsh.> CREATE KEYSPACE test WITH replication = {'class':'SimpleStrategy', 'replication_factor' : 3}; cqlsh> DESCRIBE keyspaces; test system system_traces Step 2: Creating a table using Cqlsh cqlsh> USE test; cqlsh:test>; CREATE TABLE pullcassandra( id text PRIMARY KEY emp_id text, emp_name text, emp_city text, emp_sal text, emp_phone text, ); "id" - for store the object key which is generated by loopback  Step 3: Creating a datasouce In your application root directory, enter this command to install the connector: npm install loopback-connector-cassandra --save $ lb datasource ? Enter the data-source name: mycass ? Select the connector for mycass: Cassandra (supported by StrongLoop) Connector-specif