Let's now see how to configure a Hadoop cluster of four servers, a master and three nodes.
- They’ll be referred to throughout this guide as hadoop1 ( node-master), hadoop2, hadoop3 and hadoop4. It’s recommended that you set the hostname of each node to match this naming convention.
- Create a normal user for the install, and a user called hadoop for any Hadoop daemons.
-
The steps below use example IPs for each node. Adjust each example according to your configuration:
- hadoop1: 192.168.15.1 ( node-master)
- hadoop2: 192.168.15.2 (node-worker1)
- hadoop3: 192.168.15.3 (node-worker2)
- hadoop4: 192.168.15.4 (node-worker3)
For a detailed cluster configuration read ClusterSetup
1 Architecture of a Hadoop Cluster
Typically one machine in the cluster is designated as the NameNode
and another machine the as ResourceManager
,
exclusively. These are the masters. Other services (such as Web App Proxy Server and MapReduce
Job History server)
are usually run either on dedicated hardware or on shared infrastrucutre, depending upon the load.
The rest of the machines in the cluster act as both DataNode
and NodeManager
. These are the slaves.
Before configuring the master and slave nodes, it’s important to understand the different components of a Hadoop cluster.
A master node keeps knowledge about the distributed file system, like the inode table on an ext3 filesystem, and schedules resources allocation. hadoop1 ( node-master) will handle this role in this guide, and host two daemons:
-
The
NameNode
: manages the distributed file system and knows where stored data blocks inside the cluster are. -
The
ResourceManager
: manages theYARN
jobs and takes care of scheduling and executing processes on slave nodes.
Slave nodes store the actual data and provide processing power to run the jobs. They’ll be hadoop2, hadoop3 and hadoop4, and will host two daemons:
-
The
DataNode
manages the actual data physically stored on the node; it’s named,NameNode
. -
The
NodeManager
manages execution of tasks on the node.
2 Configure the System
2.1 Create Host File on Each Node
For each node to communicate with its names, edit the /etc/hosts
file to add the IP address
of the three servers.
Don’t forget to replace the sample IP with your IP:
192.168.15.1 hadoop1 192.168.15.2 hadoop2 192.168.15.3 hadoop3 192.168.15.4 hadoop4
2.2 Distribute Authentication Key-pairs for the Hadoop
The master node will use an ssh-connection to connect to other nodes with key-pair authentication, to manage the cluster.
Login to hadoop1 ( node-master) as "hadoop" user, and generate an ssh-key:
$ ssh-keygen -b 4096
Copy the key to the other nodes. It’s good practice to also copy the key to the node-master itself,
so that you can also use it as a DataNode
if needed.
Type the following commands, and enter the hadoop user’s password when asked.
If you are prompted whether or not to add the key to known hosts, enter yes:
$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@hadoop1 $ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@hadoop2 $ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@hadoop3 $ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@hadoop4
2.3 Download and Unpack Hadoop Binaries
Login to node-master as the hadoop user, download the Hadoop tarball from Hadoop project page, and extract it.
$ cd $ wget http://apache.mindstudios.com/hadoop/common/hadoop-2.8.2/hadoop-2.8.2.tar.gz $ tar -xzf hadoop-2.8.2.tar.gz $ mv hadoop-2.8.2 hadoop
2.4 Set Environment Variables
Add Hadoop binaries to your PATH. Edit /home/hadoop/.bash_profile
and add the following line:
/home/hadoop/.bash_profile
PATH=/home/hadoop/hadoop/bin:/home/hadoop/hadoop/sbin:$PATH
2.5 Disable firewall
Hadoop uses lots of ports for communication between nodes. For disabling firewall in each node, use:
yum erase firewalld
3 Configure the Master Node
Configuration will be done on hadoop1 ( node-master) and replicated to all other worker nodes.
3.1 Set JAVA_HOME
-
Check Java is installed in your system.
Copy
rpm -q java-1.8.0-openjdk-headless
Copyyum install java-1.8.0-openjdk-headless java-1.8.0-openjdk-devel
- Get your Java installation path.
-
Edit
~/hadoop/etc/hadoop/hadoop-env.sh
and replace this lineCopyexport JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.161-0.b14.el7_4.x86_64/jre
3.2 Set NameNode Location
On each node update ~/hadoop/etc/hadoop/core-site.xml
you want to set the NameNode
location to hadoop1 ( node-master) on port 9000:
~/hadoop/etc/hadoop/core-site.xml
<configuration> <property> <name>fs.default.name</name> <value>hdfs://hadoop1:9000</value> </property> </configuration>
3.3 Set path for HDFS
Edit ~/hadoop/etc/hadoop/hdfs-site.xml
~/hadoop/etc/hadoop/hdfs-site.xml
<configuration> <property> <name>dfs.namenode.name.dir</name> <value>/home/hadoop/data/nameNode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/home/hadoop/data/dataNode</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
The last property, dfs.replication, indicates how many times data is replicated in the cluster. You can set 2 to have all the data duplicated on the two nodes. Don’t enter a value higher than the actual number of slave nodes.
3.4 Configure Slaves
The file slaves is used by startup scripts to start required daemons on all nodes. Edit ~/hadoop/etc/hadoop/slaves
to be:
~/hadoop/etc/hadoop/slaves
hadoop2 hadoop3 hadoop4
4 Configure Memory Allocation
Memory allocation can be tricky on low RAM nodes because default values are not suitable for nodes with less than 8GB of RAM.
This section will highlight how memory allocation works for MapReduce
jobs, and provide a sample configuration for 2GB RAM nodes.
4.1 The Memory Allocation Properties
A YARN
job is executed with two kind of resources:
- An Application Master (AM) is responsible for monitoring the application and coordinating distributed executors in the cluster.
-
Some executors that are created by the AM actually run the job. For a
MapReduce
jobs, they’ll perform map or reduce operation, in parallel.
Both are run in containers on slave nodes. Each slave node runs a NodeManager
daemon that’s responsible for container creation on the node. The whole cluster is managed by a ResourceManager
that schedules container allocation on all the slave-nodes, depending on capacity requirements and current charge.
Four types of resource allocations need to be configured properly for the cluster to work. These are:
-
How much memory can be allocated for
YARN
containers on a single node. This limit should be higher than all the others; otherwise, container allocation will be rejected and applications will fail. However, it should not be the entire amount of RAM on the node.This value is configured in
yarn-site.xml
:-
yarn.nodemanager.resource.memory-mb
-
-
How much memory a single container can consume and the minimum memory allocation allowed. A container will never be bigger than the maximum, or else allocation will fail and will always be allocated as a multiple of the minimum amount of RAM.
Those values are configured in
yarn-site.xml
:-
yarn.scheduler.maximum-allocation-mb
-
yarn.scheduler.minimum-allocation-mb
-
-
How much memory will be allocated to the ApplicationMaster. This is a constant value that should fit in the container maximum size.
This is configured in
mapred-site.xml
with properties:-
yarn.app.mapreduce.am.resource.mb
-
-
How much memory will be allocated to each map or reduce operation. This should be less than the maximum size.
This is configured in
mapred-site.xml
with properties:-
mapreduce.map.memory.mb
-
mapreduce.reduce.memory.mb
-
The relationship between all those properties can be seen in the following figure:
4.1.1 Sample Configuration for 2GB Nodes
For 2GB nodes, a working configuration may be:
Property | Value |
---|---|
yarn.nodemanager.resource.memory-mb | 1536 |
yarn.scheduler.maximum-allocation-mb | 1536 |
yarn.scheduler.minimum-allocation-mb | 128 |
yarn.app.mapreduce.am.resource.mb | 512 |
mapreduce.map.memory.mb | 256 |
mapreduce.reduce.memory.mb | 256 |
-
Edit
~/hadoop/etc/hadoop/yarn-site.xml
and add the following lines:Copy~/hadoop/etc/hadoop/yarn-site.xml
<property> <name>yarn.nodemanager.resource.memory-mb</name> <value>1536</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>1536</value> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>128</value> </property> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <value>hadoop1</value> </property>
-
The property
yarn.nodemanager.vmem-check-enabled
disables virtual-memory checking and can prevent containers from being allocated properly on JDK8. -
The property
yarn.resourcemanager.hostname
must point (on each node of the cluster) to the ip address or hostname of the node-master. If not specified, you will not see cluster nodes in console.
-
The property
-
Copy mapred-site.xml.template to mapred-site.xml
Copy
cp ~/hadoop/etc/hadoop/mapred-site.xml.template ~/hadoop/etc/hadoop/mapred-site.xml
-
Edit
~/hadoop/etc/hadoop/mapred-site.xml
and add the following lines:Copy~/hadoop/etc/hadoop/mapred-site.xml
<configuration> <property> <name>yarn.app.mapreduce.am.resource.mb</name> <value>512</value> </property> <property> <name>mapreduce.map.memory.mb</name> <value>256</value> </property> <property> <name>mapreduce.reduce.memory.mb</name> <value>256</value> </property> <property> <name>mapreduce.jobtracker.address</name> <value>hadoop1:9001</value> </property> </configuration>
4.1.2 View console nodes
Connecto to haddop console at http://hadoop1:8088

You should see the 3 nodes.
5 Duplicate Config Files on Each Node
-
Copy the hadoop binaries to slave nodes:
Copy
$ cd /home/hadoop/ $ scp hadoop-*.tar.gz hadoop2:/home/hadoop $ scp hadoop-*.tar.gz hadoop4:/home/hadoop $ scp hadoop-*.tar.gz hadoop4:/home/hadoop
-
Connect to hadoop2 via ssh. A password isn’t required, thanks to the ssh keys copied above:
Copy
$ ssh hadoop2
-
Unzip the binaries, rename the directory, and exit hadoop2 to get back on the node-master:
Copy
$ tar -xzf hadoop-2.8.2.tar.gz $ mv hadoop-2.8.2 hadoop $ exit
- Install hadoop in hadoop3 and hadoop4 repeating previous steps 2 and 3.
-
Copy the Hadoop configuration files to the slave nodes:
Copy
for node in hadoop2 hadoop3 hadoop4; do scp ~/hadoop/etc/hadoop/* $node:/home/hadoop/hadoop/etc/hadoop/; done
6 Format HDFS
HDFS
needs to be formatted like any classical file system. On hadoop1 ( node-master), run the following command:
$ hdfs namenode -format
Your Hadoop installation is now configured and ready to run.
7 Run and monitor HDFS
This section will walk through starting HDFS
on NameNode
and DataNodes,
and monitoring that everything is properly working and interacting with HDFS
data.
7.1 Start and Stop HDFS
-
Start the
HDFS
by running the following script from node-master:Copy$ start-dfs.sh
NameNode
andSecondaryNameNode
on hadoop1 ( node-master), andDataNode
on hadoop2, hadoop3 and hadoop4, according to the configuration in the slaves config file. -
Check that every process is running with the jps command on each node. You should get on node-master (PID will be different):
Copy
$ jps
21922 Jps 21603 NameNode 21787 SecondaryNameNode
Copy$ jps
19728 DataNode 19819 Jps
-
To stop
HDFS
on master and slave nodes, run the following command from node-master:Copy$ stop-dfs.sh
-
You can also automatically use the friendlier web user interface.
Point your browser to http://hadoop1:50070
and you’ll get a user-friendly monitoring console.
7.2 Monitor your HDFS Cluster
-
You can get useful information about running your
HDFS
cluster with the hdfs dfsadmin command. Try for example:Copy$ hdfs dfsadmin -report
Configured Capacity: 315525955584 (293.86 GB) Present Capacity: 308341153792 (287.17 GB) DFS Remaining: 308341141504 (287.17 GB) DFS Used: 12288 (12 KB) DFS Used%: 0.00% Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0 Missing blocks (with replication factor 1): 0 Pending deletion blocks: 0 ------------------------------------------------- Live datanodes (3): Name: 192.168.15.2:50010 (hadoop2) Hostname: localhost Decommission Status : Normal Configured Capacity: 105175318528 (97.95 GB) DFS Used: 4096 (4 KB) Non DFS Used: 2395021312 (2.23 GB) DFS Remaining: 102780293120 (95.72 GB) DFS Used%: 0.00% DFS Remaining%: 97.72% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 1 Last contact: Mon Feb 12 14:09:18 CET 2018 Name: 192.168.15.3:50010 (hadoop3) Hostname: localhost Decommission Status : Normal Configured Capacity: 105175318528 (97.95 GB) DFS Used: 4096 (4 KB) Non DFS Used: 2394820608 (2.23 GB) DFS Remaining: 102780493824 (95.72 GB) DFS Used%: 0.00% DFS Remaining%: 97.72% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 1 Last contact: Mon Feb 12 14:09:18 CET 2018 Name: 192.168.15.4:50010 (hadoop4) Hostname: localhost Decommission Status : Normal Configured Capacity: 105175318528 (97.95 GB) DFS Used: 4096 (4 KB) Non DFS Used: 2394959872 (2.23 GB) DFS Remaining: 102780354560 (95.72 GB) DFS Used%: 0.00% DFS Remaining%: 97.72% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 1 Last contact: Mon Feb 12 14:09:18 CET 2018
7.3 Create your home directory
First, manually create your home directory. All other commands will use a path relative to this default home directory:
$ hdfs dfs -mkdir /user $ hdfs dfs -mkdir /user/hadoop $ hdfs dfs -ls /
Found 1 items
drwxr-xr-x - hadoop supergroup 0 2018-02-12 14:11 /user
7.4 Put and Get Data to HDF
Now w can start writing and reading to HDFS
is done with command hdfs dfs.
HDFS
will be /user/hadoop
Let’s use some textbooks from the Gutenberg project as an example
-
Create a books directory in
HDFS
. The following command will create it in the home directory,/user/hadoop/books
Copy$ hdfs dfs -mkdir books $ hdfs dfs -ls -R /
drwxr-xr-x - hadoop supergroup 0 2018-02-12 14:11 /user drwxr-xr-x - hadoop supergroup 0 2018-02-12 14:14 /user/hadoop drwxr-xr-x - hadoop supergroup 0 2018-02-12 14:14 /user/hadoop/books
-
Grab a few books from the Gutenberg project:
Copy
$ wget -O alice.txt https://www.gutenberg.org/files/11/11-0.txt $ wget -O holmes.txt https://www.gutenberg.org/ebooks/1661.txt.utf-8 $ wget -O frankenstein.txt http://www.gutenberg.org/files/84/84-0.txt
-
Put the three books through
HDFS
, in the booksdirectoryCopy$ hdfs dfs -put alice.txt holmes.txt frankenstein.txt books
-
List the contents of the book directory:
Copy
$ hdfs dfs -ls books
Found 3 items -rw-r--r-- 1 deister supergroup 173595 2018-02-10 14:36 books/alice.txt -rw-r--r-- 1 deister supergroup 168588 2018-02-10 14:36 books/frankenstein.txt -rw-r--r-- 1 deister supergroup 594933 2018-02-10 14:36 books/holmes.txt
-
You can also directly print the books from
HDFS
Copy$ hdfs dfs -cat books/alice.txt
ALICE’S ADVENTURES IN WONDERLAND Lewis Carroll THE MILLENNIUM FULCRUM EDITION 3.0 CHAPTER I. Down the Rabbit-Hole Alice was beginning to get very tired of sitting by her sister on the ...
There are many commands to manage your HDFS
. For a complete list, you can look at the
Apache HDFS shell
documentation, or print help with:
$ hdfs dfs -help
7.5 Auto Start-Stop Hadoop using systemd
To start and stop hadoop when system start or shutdown, you can create a systemd service unit at /etc/systemd/system/hadoop.service .
[Unit] Description=Hadoop DFS namenode and datanode After=syslog.target network.target remote-fs.target nss-lookup.target network-online.target Requires=network-online.target [Service] User=hadoop Group=hadoop Type=forking ExecStart=/home/hadoop/hadoop/sbin/start-dfs.sh ExecStop=/home/hadoop/hadoop/sbin/stop-dfs.sh WorkingDirectory=/home/hadoop/ Environment=JAVA_HOME=/usr/lib/jvm/java-1.8.0 Environment=HADOOP_HOME=/home/hadoop/hadoop TimeoutStartSec=2min Restart=on-failure PIDFile=/tmp/hadoop-hadoop-namenode.pid [Install] WantedBy=multi-user.target
Now that we have our systemd script, we can now start our Hadoop daemons from systemd:
systemctl start hadoop.service # Start service
To run Hadoop daemons at startup:
systemctl enable hadoop.service
Finally, to stop your daemons:
systemctl stop hadoop.service
7.6 Exit Hadoop safe mode
When hadoop engine is stoped improperlly or in other extrange circumpstances, hadoop can enter in safe mode. Then, trying to use hadoop throws exceptions similar than this:
org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot create directory /tmp/hive/hadoop/15f0a0a4-74b9-40db-b8f9-0ddc1d7de9dd. Name node is in safe mode.
To exit safe mode and enable normal hadoop operation, you can use this administration command:
hadoop dfsadmin -safemode leave
8 Run YARN
HDFS
is a distributed storage system, it doesn’t provide any services for running and scheduling tasks in the cluster.
This is the role of the YARN
framework. The following section is about starting, monitoring, and submitting jobs to YARN
.
8.1 Start and Stop YARN
-
Start
YARN
with the script:Copy$ start-yarn.sh
starting yarn daemons starting resourcemanager, logging to /home/hadoop/hadoop/logs/yarn-hadoop-resourcemanager-hadoop1.localdomain.out hadoop2: starting nodemanager, logging to /home/hadoop/hadoop/logs/yarn-hadoop-nodemanager-hadoop2.localdomain.out hadoop4: starting nodemanager, logging to /home/hadoop/hadoop/logs/yarn-hadoop-nodemanager-hadoop4.localdomain.out hadoop3: starting nodemanager, logging to /home/hadoop/hadoop/logs/yarn-hadoop-nodemanager-hadoop3.localdomain.out
-
Check that everything is running with the jps command.
In addition to the previous
HDFS
daemon, you should see aResourceManager
on hadoop1 ( node-master), and aNodeManager
on hadoop2, hadoop3 and hadoop4.Copy$ jps
16577 Jps 16307 ResourceManager 15382 NameNode 15590 SecondaryNameNode
-
To stop
YARN
, run the following command on hadoop1 ( node-master):Copy$ stop-yarn.sh
8.2 Add the Job History server
Start the MapReduce
JobHistory Server with the following command, run on the designated server as mapred,
that in our case is the first node for simplicity.
$ mr-jobhistory-daemon.sh start historyserver
8.3 Monitor YARN
The yarn command provides utilities to manage your YARN
cluster. You can also print a report of running nodes with the command:
$ yarn node -list
18/02/13 10:01:23 INFO client.RMProxy: Connecting to ResourceManager at hadoop1/192.168.15.1:8032
Total Nodes:3
Node-Id Node-State Node-Http-Address Number-of-Running-Containers
localhost:44264 RUNNING localhost:8042 0
localhost:38603 RUNNING localhost:8042 0
localhost:35701 RUNNING localhost:8042 0
As with HDFS
, YARN
provides a friendlier web UI, started by default on port 8088 of the Resource Manager.
Point your browser to http://hadoop1:8088 and browse the UI.

8.4 Submit MapReduce Jobs to YARN
Yarn jobs are packaged into jar files and submitted to YARN
for execution with the command yarn jar.
The Hadoop installation package provides sample applications that can be run to test your cluster.
You’ll use them to run a word count on the three books previously uploaded to HDFS
.
-
Submit a job with the sample jar to
YARN
. On hadoop1 ( node-master), run:Copy$ hdfs dfs -rm -r output $ yarn jar ~/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.2.jar wordcount "books/*" output
HDFS
. In this example, it will be a directory named "output" itself. -
After the job is finished, you can get the result by querying
HDFS
with hdfs dfs -ls output. In case of a success, the output will resemble:Copy$ hdfs dfs -ls output
Found 2 items -rw-r--r-- 1 hadoop supergroup 0 2017-10-11 14:09 output/_SUCCESS -rw-r--r-- 1 hadoop supergroup 269158 2017-10-11 14:09 output/part-r-00000
-
Print the result with:
Copy
$ hdfs dfs -cat output/part-r-00000
"'A 1 "'About 1 "'Absolute 1 "'Ah!' 2 "'Ah, 2 "'Ample.' 1 "'And 10 "'Are 1 "'Arthur!' 1 "'As 1 "'At 1 "'Because 1 "'Boy, 1 "'Breckinridge, 1 "'But 1 "'But, 1 "'But,' 1 "'Certainly 2 ... “come 1 “it” 2 “much 1 “poison” 1 “purpose”?’ 1 “‘TIS 1