Let's now see how to configure a Hadoop cluster of four servers, a master and three nodes.

  1. They’ll be referred to throughout this guide as hadoop1 ( node-master), hadoop2, hadoop3 and hadoop4. It’s recommended that you set the hostname of each node to match this naming convention.
  2. Create a normal user for the install, and a user called hadoop for any Hadoop daemons.
  3. The steps below use example IPs for each node. Adjust each example according to your configuration:
    • hadoop1: 192.168.15.1 ( node-master)
    • hadoop2: 192.168.15.2 (node-worker1)
    • hadoop3: 192.168.15.3 (node-worker2)
    • hadoop4: 192.168.15.4 (node-worker3)

For a detailed cluster configuration read ClusterSetup

1 Architecture of a Hadoop Cluster

Typically one machine in the cluster is designated as the NameNode and another machine the as ResourceManager, exclusively. These are the masters. Other services (such as Web App Proxy Server and MapReduce Job History server) are usually run either on dedicated hardware or on shared infrastrucutre, depending upon the load.

The rest of the machines in the cluster act as both DataNode and NodeManager. These are the slaves.

Before configuring the master and slave nodes, it’s important to understand the different components of a Hadoop cluster.

A master node keeps knowledge about the distributed file system, like the inode table on an ext3 filesystem, and schedules resources allocation. hadoop1 ( node-master) will handle this role in this guide, and host two daemons:

  • The NameNode: manages the distributed file system and knows where stored data blocks inside the cluster are.
  • The ResourceManager: manages the YARN jobs and takes care of scheduling and executing processes on slave nodes.

Slave nodes store the actual data and provide processing power to run the jobs. They’ll be hadoop2, hadoop3 and hadoop4, and will host two daemons:

  • The DataNode manages the actual data physically stored on the node; it’s named, NameNode.
  • The NodeManager manages execution of tasks on the node.

2 Configure the System

2.1 Create Host File on Each Node

For each node to communicate with its names, edit the /etc/hosts file to add the IP address of the three servers. Don’t forget to replace the sample IP with your IP:

Copy
192.168.15.1	hadoop1
192.168.15.2	hadoop2
192.168.15.3	hadoop3
192.168.15.4	hadoop4

2.2 Distribute Authentication Key-pairs for the Hadoop

The master node will use an ssh-connection to connect to other nodes with key-pair authentication, to manage the cluster.

Login to hadoop1 ( node-master) as "hadoop" user, and generate an ssh-key:

Copy
$ ssh-keygen -b 4096

Copy the key to the other nodes. It’s good practice to also copy the key to the node-master itself, so that you can also use it as a DataNode if needed. Type the following commands, and enter the hadoop user’s password when asked. If you are prompted whether or not to add the key to known hosts, enter yes:

Copy
$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@hadoop1
$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@hadoop2
$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@hadoop3
$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@hadoop4

2.3 Download and Unpack Hadoop Binaries

Login to node-master as the hadoop user, download the Hadoop tarball from Hadoop project page, and extract it.

Copy
$ cd
$ wget http://apache.mindstudios.com/hadoop/common/hadoop-2.8.2/hadoop-2.8.2.tar.gz
$ tar -xzf hadoop-2.8.2.tar.gz
$ mv hadoop-2.8.2 hadoop

2.4 Set Environment Variables

Add Hadoop binaries to your PATH. Edit /home/hadoop/.bash_profile and add the following line:

Copy

/home/hadoop/.bash_profile

PATH=/home/hadoop/hadoop/bin:/home/hadoop/hadoop/sbin:$PATH

2.5 Disable firewall

Hadoop uses lots of ports for communication between nodes. For disabling firewall in each node, use:

Copy
yum erase firewalld

3 Configure the Master Node

Configuration will be done on hadoop1 ( node-master) and replicated to all other worker nodes.

3.1 Set JAVA_HOME

  1. Check Java is installed in your system.
    Copy
    rpm -q java-1.8.0-openjdk-headless
    If not installed, install it with OS installation command:
    Copy
    yum install java-1.8.0-openjdk-headless java-1.8.0-openjdk-devel
  2. Get your Java installation path.
  3. Edit ~/hadoop/etc/hadoop/hadoop-env.sh and replace this line
    Copy
    export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.161-0.b14.el7_4.x86_64/jre
    with your actual java installation path.

3.2 Set NameNode Location

On each node update ~/hadoop/etc/hadoop/core-site.xml you want to set the NameNode location to hadoop1 ( node-master) on port 9000:

Copy

~/hadoop/etc/hadoop/core-site.xml

<configuration>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://hadoop1:9000</value>
    </property>
</configuration>

3.3 Set path for HDFS

Edit ~/hadoop/etc/hadoop/hdfs-site.xml

Copy

~/hadoop/etc/hadoop/hdfs-site.xml

<configuration>
    <property>
            <name>dfs.namenode.name.dir</name>
            <value>/home/hadoop/data/nameNode</value>
    </property>

    <property>
            <name>dfs.datanode.data.dir</name>
            <value>/home/hadoop/data/dataNode</value>
    </property>

    <property>
            <name>dfs.replication</name>
            <value>1</value>
    </property>
</configuration>

The last property, dfs.replication, indicates how many times data is replicated in the cluster. You can set 2 to have all the data duplicated on the two nodes. Don’t enter a value higher than the actual number of slave nodes.

3.4 Configure Slaves

The file slaves is used by startup scripts to start required daemons on all nodes. Edit ~/hadoop/etc/hadoop/slaves to be:

Copy

~/hadoop/etc/hadoop/slaves

hadoop2
hadoop3
hadoop4

4 Configure Memory Allocation

Memory allocation can be tricky on low RAM nodes because default values are not suitable for nodes with less than 8GB of RAM. This section will highlight how memory allocation works for MapReduce jobs, and provide a sample configuration for 2GB RAM nodes.

4.1 The Memory Allocation Properties

A YARN job is executed with two kind of resources:

  • An Application Master (AM) is responsible for monitoring the application and coordinating distributed executors in the cluster.
  • Some executors that are created by the AM actually run the job. For a MapReduce jobs, they’ll perform map or reduce operation, in parallel.

Both are run in containers on slave nodes. Each slave node runs a NodeManager daemon that’s responsible for container creation on the node. The whole cluster is managed by a ResourceManager that schedules container allocation on all the slave-nodes, depending on capacity requirements and current charge.

Four types of resource allocations need to be configured properly for the cluster to work. These are:

  1. How much memory can be allocated for YARN containers on a single node. This limit should be higher than all the others; otherwise, container allocation will be rejected and applications will fail. However, it should not be the entire amount of RAM on the node.

    This value is configured in yarn-site.xml:

    • yarn.nodemanager.resource.memory-mb

  2. How much memory a single container can consume and the minimum memory allocation allowed. A container will never be bigger than the maximum, or else allocation will fail and will always be allocated as a multiple of the minimum amount of RAM.

    Those values are configured in yarn-site.xml:

    • yarn.scheduler.maximum-allocation-mb
    • yarn.scheduler.minimum-allocation-mb

  3. How much memory will be allocated to the ApplicationMaster. This is a constant value that should fit in the container maximum size.

    This is configured in mapred-site.xml with properties:

    • yarn.app.mapreduce.am.resource.mb

  4. How much memory will be allocated to each map or reduce operation. This should be less than the maximum size.

    This is configured in mapred-site.xml with properties:

    • mapreduce.map.memory.mb
    • mapreduce.reduce.memory.mb

The relationship between all those properties can be seen in the following figure:

4.1.1 Sample Configuration for 2GB Nodes

For 2GB nodes, a working configuration may be:

Property Value
yarn.nodemanager.resource.memory-mb 1536
yarn.scheduler.maximum-allocation-mb 1536
yarn.scheduler.minimum-allocation-mb 128
yarn.app.mapreduce.am.resource.mb 512
mapreduce.map.memory.mb 256
mapreduce.reduce.memory.mb 256
  1. Edit ~/hadoop/etc/hadoop/yarn-site.xml and add the following lines:
    Copy

    ~/hadoop/etc/hadoop/yarn-site.xml

    <property>
            <name>yarn.nodemanager.resource.memory-mb</name>
            <value>1536</value>
    </property>
    
    <property>
            <name>yarn.scheduler.maximum-allocation-mb</name>
            <value>1536</value>
    </property>
    
    <property>
            <name>yarn.scheduler.minimum-allocation-mb</name>
            <value>128</value>
    </property>
    
    <property>
            <name>yarn.nodemanager.vmem-check-enabled</name>
            <value>false</value>
    </property>
    <property>
            <name>yarn.resourcemanager.hostname</name>
            <value>hadoop1</value>
    </property>
    • The property yarn.nodemanager.vmem-check-enabled disables virtual-memory checking and can prevent containers from being allocated properly on JDK8.
    • The property yarn.resourcemanager.hostname must point (on each node of the cluster) to the ip address or hostname of the node-master. If not specified, you will not see cluster nodes in console.
  2. Copy mapred-site.xml.template to mapred-site.xml
    Copy
    cp ~/hadoop/etc/hadoop/mapred-site.xml.template ~/hadoop/etc/hadoop/mapred-site.xml
  3. Edit ~/hadoop/etc/hadoop/mapred-site.xml and add the following lines:
    Copy

    ~/hadoop/etc/hadoop/mapred-site.xml

    <configuration>
        <property>
            <name>yarn.app.mapreduce.am.resource.mb</name>
            <value>512</value>
        </property>
        <property>
            <name>mapreduce.map.memory.mb</name>
            <value>256</value>
        </property>
        <property>
            <name>mapreduce.reduce.memory.mb</name>
            <value>256</value>
        </property>
       <property> 
          <name>mapreduce.jobtracker.address</name> 
          <value>hadoop1:9001</value> 
       </property> 
    </configuration>

4.1.2 View console nodes

Connecto to haddop console at http://hadoop1:8088

You should see the 3 nodes.

5 Duplicate Config Files on Each Node

  1. Copy the hadoop binaries to slave nodes:
    Copy
    $ cd /home/hadoop/
    $ scp hadoop-*.tar.gz hadoop2:/home/hadoop
    $ scp hadoop-*.tar.gz hadoop4:/home/hadoop    
    $ scp hadoop-*.tar.gz hadoop4:/home/hadoop
  2. Connect to hadoop2 via ssh. A password isn’t required, thanks to the ssh keys copied above:
    Copy
    $ ssh hadoop2
  3. Unzip the binaries, rename the directory, and exit hadoop2 to get back on the node-master:
    Copy
    $ tar -xzf hadoop-2.8.2.tar.gz
    $ mv hadoop-2.8.2 hadoop
    $ exit
  4. Install hadoop in hadoop3 and hadoop4 repeating previous steps 2 and 3.
  5. Copy the Hadoop configuration files to the slave nodes:
    Copy
    for node in hadoop2 hadoop3 hadoop4; do
        scp ~/hadoop/etc/hadoop/* $node:/home/hadoop/hadoop/etc/hadoop/;
    done

6 Format HDFS

HDFS needs to be formatted like any classical file system. On hadoop1 ( node-master), run the following command:

Copy
$ hdfs namenode -format

Your Hadoop installation is now configured and ready to run.

7 Run and monitor HDFS

This section will walk through starting HDFS on NameNode and DataNodes, and monitoring that everything is properly working and interacting with HDFS data.

7.1 Start and Stop HDFS

  1. Start the HDFS by running the following script from node-master:
    Copy
    $ start-dfs.sh
    It’ll start NameNode and SecondaryNameNode on hadoop1 ( node-master), and DataNode on hadoop2, hadoop3 and hadoop4, according to the configuration in the slaves config file.
  2. Check that every process is running with the jps command on each node. You should get on node-master (PID will be different):
    Copy
    $ jps
    21922 Jps
    21603 NameNode
    21787 SecondaryNameNode
    and on hadoop2, hadoop3 and hadoop4:
    Copy
    $ jps
    19728 DataNode
    19819 Jps
  3. To stop HDFS on master and slave nodes, run the following command from node-master:
    Copy
    $ stop-dfs.sh
  4. You can also automatically use the friendlier web user interface. Point your browser to http://hadoop1:50070 and you’ll get a user-friendly monitoring console.

7.2 Monitor your HDFS Cluster

  1. You can get useful information about running your HDFS cluster with the hdfs dfsadmin command. Try for example:
    Copy
    $ hdfs dfsadmin -report
    Configured Capacity: 315525955584 (293.86 GB)
    Present Capacity: 308341153792 (287.17 GB)
    DFS Remaining: 308341141504 (287.17 GB)
    DFS Used: 12288 (12 KB)
    DFS Used%: 0.00%
    Under replicated blocks: 0
    Blocks with corrupt replicas: 0
    Missing blocks: 0
    Missing blocks (with replication factor 1): 0
    Pending deletion blocks: 0
    
    -------------------------------------------------
    Live datanodes (3):
    
    Name: 192.168.15.2:50010 (hadoop2)
    Hostname: localhost
    Decommission Status : Normal
    Configured Capacity: 105175318528 (97.95 GB)
    DFS Used: 4096 (4 KB)
    Non DFS Used: 2395021312 (2.23 GB)
    DFS Remaining: 102780293120 (95.72 GB)
    DFS Used%: 0.00%
    DFS Remaining%: 97.72%
    Configured Cache Capacity: 0 (0 B)
    Cache Used: 0 (0 B)
    Cache Remaining: 0 (0 B)
    Cache Used%: 100.00%
    Cache Remaining%: 0.00%
    Xceivers: 1
    Last contact: Mon Feb 12 14:09:18 CET 2018
    
    
    Name: 192.168.15.3:50010 (hadoop3)
    Hostname: localhost
    Decommission Status : Normal
    Configured Capacity: 105175318528 (97.95 GB)
    DFS Used: 4096 (4 KB)
    Non DFS Used: 2394820608 (2.23 GB)
    DFS Remaining: 102780493824 (95.72 GB)
    DFS Used%: 0.00%
    DFS Remaining%: 97.72%
    Configured Cache Capacity: 0 (0 B)
    Cache Used: 0 (0 B)
    Cache Remaining: 0 (0 B)
    Cache Used%: 100.00%
    Cache Remaining%: 0.00%
    Xceivers: 1
    Last contact: Mon Feb 12 14:09:18 CET 2018
    
    
    Name: 192.168.15.4:50010 (hadoop4)
    Hostname: localhost
    Decommission Status : Normal
    Configured Capacity: 105175318528 (97.95 GB)
    DFS Used: 4096 (4 KB)
    Non DFS Used: 2394959872 (2.23 GB)
    DFS Remaining: 102780354560 (95.72 GB)
    DFS Used%: 0.00%
    DFS Remaining%: 97.72%
    Configured Cache Capacity: 0 (0 B)
    Cache Used: 0 (0 B)
    Cache Remaining: 0 (0 B)
    Cache Used%: 100.00%
    Cache Remaining%: 0.00%
    Xceivers: 1
    Last contact: Mon Feb 12 14:09:18 CET 2018

7.3 Create your home directory

First, manually create your home directory. All other commands will use a path relative to this default home directory:

Copy
$ hdfs dfs -mkdir /user
$ hdfs dfs -mkdir /user/hadoop
$ hdfs dfs -ls /
Found 1 items
drwxr-xr-x   - hadoop supergroup          0 2018-02-12 14:11 /user

7.4 Put and Get Data to HDF

Now w can start writing and reading to HDFS is done with command hdfs dfs.

Notice we are using hadoop as username, so our home directory in HDFS will be /user/hadoop

Let’s use some textbooks from the Gutenberg project as an example

  1. Create a books directory in HDFS. The following command will create it in the home directory, /user/hadoop/books
    Copy
    $ hdfs dfs -mkdir books
    $ hdfs dfs -ls -R /
    drwxr-xr-x   - hadoop supergroup          0 2018-02-12 14:11 /user
    drwxr-xr-x   - hadoop supergroup          0 2018-02-12 14:14 /user/hadoop
    drwxr-xr-x   - hadoop supergroup          0 2018-02-12 14:14 /user/hadoop/books
  2. Grab a few books from the Gutenberg project:
    Copy
    $ wget -O alice.txt https://www.gutenberg.org/files/11/11-0.txt
    $ wget -O holmes.txt https://www.gutenberg.org/ebooks/1661.txt.utf-8
    $ wget -O frankenstein.txt http://www.gutenberg.org/files/84/84-0.txt
  3. Put the three books through HDFS, in the booksdirectory
    Copy
    $ hdfs dfs -put alice.txt holmes.txt frankenstein.txt books
  4. List the contents of the book directory:
    Copy
    $ hdfs dfs -ls books
    Found 3 items
    -rw-r--r--   1 deister supergroup     173595 2018-02-10 14:36 books/alice.txt
    -rw-r--r--   1 deister supergroup     168588 2018-02-10 14:36 books/frankenstein.txt
    -rw-r--r--   1 deister supergroup     594933 2018-02-10 14:36 books/holmes.txt
  5. You can also directly print the books from HDFS
    Copy
    $ hdfs dfs -cat books/alice.txt
    ALICE’S ADVENTURES IN WONDERLAND
    
    Lewis Carroll
    
    THE MILLENNIUM FULCRUM EDITION 3.0
    
    CHAPTER I. Down the Rabbit-Hole
    
    Alice was beginning to get very tired of sitting by her sister on the
    ...

There are many commands to manage your HDFS. For a complete list, you can look at the Apache HDFS shell documentation, or print help with:

Copy
$ hdfs dfs -help

7.5 Auto Start-Stop Hadoop using systemd

To start and stop hadoop when system start or shutdown, you can create a systemd service unit at /etc/systemd/system/hadoop.service .

Copy
[Unit]
Description=Hadoop DFS namenode and datanode
After=syslog.target network.target remote-fs.target nss-lookup.target network-online.target
Requires=network-online.target

[Service]
User=hadoop
Group=hadoop
Type=forking
ExecStart=/home/hadoop/hadoop/sbin/start-dfs.sh
ExecStop=/home/hadoop/hadoop/sbin/stop-dfs.sh
WorkingDirectory=/home/hadoop/
Environment=JAVA_HOME=/usr/lib/jvm/java-1.8.0
Environment=HADOOP_HOME=/home/hadoop/hadoop
TimeoutStartSec=2min
Restart=on-failure
PIDFile=/tmp/hadoop-hadoop-namenode.pid

[Install]
WantedBy=multi-user.target

Now that we have our systemd script, we can now start our Hadoop daemons from systemd:

Copy
systemctl start hadoop.service # Start service

To run Hadoop daemons at startup:

Copy
systemctl enable hadoop.service

Finally, to stop your daemons:

Copy
systemctl stop hadoop.service

7.6 Exit Hadoop safe mode

When hadoop engine is stoped improperlly or in other extrange circumpstances, hadoop can enter in safe mode. Then, trying to use hadoop throws exceptions similar than this:

Copy
org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot create directory /tmp/hive/hadoop/15f0a0a4-74b9-40db-b8f9-0ddc1d7de9dd. Name node is in safe mode.

To exit safe mode and enable normal hadoop operation, you can use this administration command:

Copy
hadoop dfsadmin -safemode leave

8 Run YARN

HDFS is a distributed storage system, it doesn’t provide any services for running and scheduling tasks in the cluster. This is the role of the YARN framework. The following section is about starting, monitoring, and submitting jobs to YARN.

8.1 Start and Stop YARN

  • Start YARN with the script:
    Copy
    $ start-yarn.sh
    starting yarn daemons
    starting resourcemanager, logging to /home/hadoop/hadoop/logs/yarn-hadoop-resourcemanager-hadoop1.localdomain.out
    hadoop2: starting nodemanager, logging to /home/hadoop/hadoop/logs/yarn-hadoop-nodemanager-hadoop2.localdomain.out
    hadoop4: starting nodemanager, logging to /home/hadoop/hadoop/logs/yarn-hadoop-nodemanager-hadoop4.localdomain.out
    hadoop3: starting nodemanager, logging to /home/hadoop/hadoop/logs/yarn-hadoop-nodemanager-hadoop3.localdomain.out
  • Check that everything is running with the jps command. In addition to the previous HDFS daemon, you should see a ResourceManager on hadoop1 ( node-master), and a NodeManager on hadoop2, hadoop3 and hadoop4.
    Copy
    $ jps
    16577 Jps
    16307 ResourceManager
    15382 NameNode
    15590 SecondaryNameNode
  • To stop YARN, run the following command on hadoop1 ( node-master):
    Copy
    $ stop-yarn.sh

8.2 Add the Job History server

Start the MapReduce JobHistory Server with the following command, run on the designated server as mapred, that in our case is the first node for simplicity.

Copy
$ mr-jobhistory-daemon.sh start historyserver

8.3 Monitor YARN

The yarn command provides utilities to manage your YARN cluster. You can also print a report of running nodes with the command:

Copy
$ yarn node -list
18/02/13 10:01:23 INFO client.RMProxy: Connecting to ResourceManager at hadoop1/192.168.15.1:8032
Total Nodes:3
         Node-Id	     Node-State	Node-Http-Address	Number-of-Running-Containers
 localhost:44264	        RUNNING	   localhost:8042	                           0
 localhost:38603	        RUNNING	   localhost:8042	                           0
 localhost:35701	        RUNNING	   localhost:8042	                           0

As with HDFS, YARN provides a friendlier web UI, started by default on port 8088 of the Resource Manager. Point your browser to http://hadoop1:8088 and browse the UI.

8.4 Submit MapReduce Jobs to YARN

Yarn jobs are packaged into jar files and submitted to YARN for execution with the command yarn jar. The Hadoop installation package provides sample applications that can be run to test your cluster. You’ll use them to run a word count on the three books previously uploaded to HDFS.

  1. Submit a job with the sample jar to YARN. On hadoop1 ( node-master), run:
    Copy
    $ hdfs dfs -rm -r output
        
    $ yarn jar ~/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.2.jar wordcount "books/*" output
    The last argument is where the output of the job will be saved - in HDFS. In this example, it will be a directory named "output" itself.
  2. After the job is finished, you can get the result by querying HDFS with hdfs dfs -ls output. In case of a success, the output will resemble:
    Copy
    $ hdfs dfs -ls output
    Found 2 items
    -rw-r--r--   1 hadoop supergroup          0 2017-10-11 14:09 output/_SUCCESS
    -rw-r--r--   1 hadoop supergroup     269158 2017-10-11 14:09 output/part-r-00000
  3. Print the result with:
    Copy
    $ hdfs dfs -cat output/part-r-00000
    "'A	1
    "'About	1
    "'Absolute	1
    "'Ah!'	2
    "'Ah,	2
    "'Ample.'	1
    "'And	10
    "'Are	1
    "'Arthur!'	1
    "'As	1
    "'At	1
    "'Because	1
    "'Boy,	1
    "'Breckinridge,	1
    "'But	1
    "'But,	1
    "'But,'	1
    "'Certainly	2
    ...
    “come	1
    “it”	2
    “much	1
    “poison”	1
    “purpose”?’	1
    “‘TIS	1