Let's now see how to configure a Hadoop cluster of four servers, a master and three nodes.
- They’ll be referred to throughout this guide as hadoop1 ( node-master), hadoop2, hadoop3 and hadoop4. It’s recommended that you set the hostname of each node to match this naming convention.
- Create a normal user for the install, and a user called hadoop for any Hadoop daemons.
The steps below use example IPs for each node. Adjust each example according to your configuration:
- hadoop1: 192.168.15.1 ( node-master)
- hadoop2: 192.168.15.2 (node-worker1)
- hadoop3: 192.168.15.3 (node-worker2)
- hadoop4: 192.168.15.4 (node-worker3)
For a detailed cluster configuration read ClusterSetup
1 Architecture of a Hadoop Cluster
Typically one machine in the cluster is designated as the
NameNode and another machine the as
exclusively. These are the masters. Other services (such as Web App Proxy Server and
MapReduce Job History server)
are usually run either on dedicated hardware or on shared infrastrucutre, depending upon the load.
The rest of the machines in the cluster act as both
NodeManager. These are the slaves.
Before configuring the master and slave nodes, it’s important to understand the different components of a Hadoop cluster.
A master node keeps knowledge about the distributed file system, like the inode table on an ext3 filesystem, and schedules resources allocation. hadoop1 ( node-master) will handle this role in this guide, and host two daemons:
NameNode: manages the distributed file system and knows where stored data blocks inside the cluster are.
ResourceManager: manages the
YARNjobs and takes care of scheduling and executing processes on slave nodes.
Slave nodes store the actual data and provide processing power to run the jobs. They’ll be hadoop2, hadoop3 and hadoop4, and will host two daemons:
DataNodemanages the actual data physically stored on the node; it’s named,
NodeManagermanages execution of tasks on the node.
2 Configure the System
2.1 Create Host File on Each Node
For each node to communicate with its names, edit the
/etc/hosts file to add the IP address
of the three servers.
Don’t forget to replace the sample IP with your IP:
192.168.15.1 hadoop1 192.168.15.2 hadoop2 192.168.15.3 hadoop3 192.168.15.4 hadoop4
2.2 Distribute Authentication Key-pairs for the Hadoop
The master node will use an ssh-connection to connect to other nodes with key-pair authentication, to manage the cluster.
Login to hadoop1 ( node-master) as "hadoop" user, and generate an ssh-key:
$ ssh-keygen -b 4096
Copy the key to the other nodes. It’s good practice to also copy the key to the node-master itself,
so that you can also use it as a
DataNode if needed.
Type the following commands, and enter the hadoop user’s password when asked.
If you are prompted whether or not to add the key to known hosts, enter yes:
$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@hadoop1 $ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@hadoop2 $ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@hadoop3 $ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@hadoop4
2.3 Download and Unpack Hadoop Binaries
$ cd $ wget http://apache.mindstudios.com/hadoop/common/hadoop-2.8.2/hadoop-2.8.2.tar.gz $ tar -xzf hadoop-2.8.2.tar.gz $ mv hadoop-2.8.2 hadoop
2.4 Set Environment Variables
Add Hadoop binaries to your PATH. Edit
/home/hadoop/.bash_profile and add the following line:
2.5 Disable firewall
Hadoop uses lots of ports for communication between nodes. For disabling firewall in each node, use:
yum erase firewalld
3 Configure the Master Node
Configuration will be done on hadoop1 ( node-master) and replicated to all other worker nodes.
3.1 Set JAVA_HOME
Check Java is installed in your system.
rpm -q java-1.8.0-openjdk-headlessCopy
yum install java-1.8.0-openjdk-headless java-1.8.0-openjdk-devel
- Get your Java installation path.
~/hadoop/etc/hadoop/hadoop-env.shand replace this lineCopy
3.2 Set NameNode Location
On each node update
~/hadoop/etc/hadoop/core-site.xml you want to set the
NameNode location to hadoop1 ( node-master) on port 9000:
<configuration> <property> <name>fs.default.name</name> <value>hdfs://hadoop1:9000</value> </property> </configuration>
3.3 Set path for HDFS
<configuration> <property> <name>dfs.namenode.name.dir</name> <value>/home/hadoop/data/nameNode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/home/hadoop/data/dataNode</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
The last property, dfs.replication, indicates how many times data is replicated in the cluster. You can set 2 to have all the data duplicated on the two nodes. Don’t enter a value higher than the actual number of slave nodes.
3.4 Configure Slaves
The file slaves is used by startup scripts to start required daemons on all nodes. Edit
~/hadoop/etc/hadoop/slaves to be:
hadoop2 hadoop3 hadoop4
4 Configure Memory Allocation
Memory allocation can be tricky on low RAM nodes because default values are not suitable for nodes with less than 8GB of RAM.
This section will highlight how memory allocation works for
MapReduce jobs, and provide a sample configuration for 2GB RAM nodes.
4.1 The Memory Allocation Properties
YARN job is executed with two kind of resources:
- An Application Master (AM) is responsible for monitoring the application and coordinating distributed executors in the cluster.
Some executors that are created by the AM actually run the job. For a
MapReducejobs, they’ll perform map or reduce operation, in parallel.
Both are run in containers on slave nodes. Each slave node runs a
NodeManager daemon that’s responsible for container creation on the node. The whole cluster is managed by a
ResourceManager that schedules container allocation on all the slave-nodes, depending on capacity requirements and current charge.
Four types of resource allocations need to be configured properly for the cluster to work. These are:
How much memory can be allocated for
YARNcontainers on a single node. This limit should be higher than all the others; otherwise, container allocation will be rejected and applications will fail. However, it should not be the entire amount of RAM on the node.
This value is configured in
How much memory a single container can consume and the minimum memory allocation allowed. A container will never be bigger than the maximum, or else allocation will fail and will always be allocated as a multiple of the minimum amount of RAM.
Those values are configured in
How much memory will be allocated to the ApplicationMaster. This is a constant value that should fit in the container maximum size.
This is configured in
How much memory will be allocated to each map or reduce operation. This should be less than the maximum size.
This is configured in
4.1.1 Sample Configuration for 2GB Nodes
For 2GB nodes, a working configuration may be:
~/hadoop/etc/hadoop/yarn-site.xmland add the following lines:Copy
<property> <name>yarn.nodemanager.resource.memory-mb</name> <value>1536</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>1536</value> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>128</value> </property> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <value>hadoop1</value> </property>
yarn.nodemanager.vmem-check-enableddisables virtual-memory checking and can prevent containers from being allocated properly on JDK8.
yarn.resourcemanager.hostnamemust point (on each node of the cluster) to the ip address or hostname of the node-master. If not specified, you will not see cluster nodes in console.
- The property
Copy mapred-site.xml.template to mapred-site.xml
cp ~/hadoop/etc/hadoop/mapred-site.xml.template ~/hadoop/etc/hadoop/mapred-site.xml
~/hadoop/etc/hadoop/mapred-site.xmland add the following lines:Copy
<configuration> <property> <name>yarn.app.mapreduce.am.resource.mb</name> <value>512</value> </property> <property> <name>mapreduce.map.memory.mb</name> <value>256</value> </property> <property> <name>mapreduce.reduce.memory.mb</name> <value>256</value> </property> <property> <name>mapreduce.jobtracker.address</name> <value>hadoop1:9001</value> </property> </configuration>
4.1.2 View console nodes
Connecto to haddop console at http://hadoop1:8088
You should see the 3 nodes.
5 Duplicate Config Files on Each Node
Copy the hadoop binaries to slave nodes:
$ cd /home/hadoop/ $ scp hadoop-*.tar.gz hadoop2:/home/hadoop $ scp hadoop-*.tar.gz hadoop4:/home/hadoop $ scp hadoop-*.tar.gz hadoop4:/home/hadoop
Connect to hadoop2 via ssh. A password isn’t required, thanks to the ssh keys copied above:
$ ssh hadoop2
Unzip the binaries, rename the directory, and exit hadoop2 to get back on the node-master:
$ tar -xzf hadoop-2.8.2.tar.gz $ mv hadoop-2.8.2 hadoop $ exit
- Install hadoop in hadoop3 and hadoop4 repeating previous steps 2 and 3.
Copy the Hadoop configuration files to the slave nodes:
for node in hadoop2 hadoop3 hadoop4; do scp ~/hadoop/etc/hadoop/* $node:/home/hadoop/hadoop/etc/hadoop/; done
6 Format HDFS
HDFS needs to be formatted like any classical file system. On hadoop1 ( node-master), run the following command:
$ hdfs namenode -format
Your Hadoop installation is now configured and ready to run.
7 Run and monitor HDFS
This section will walk through starting
NameNode and DataNodes,
and monitoring that everything is properly working and interacting with
7.1 Start and Stop HDFS
HDFSby running the following script from node-master:Copy
SecondaryNameNodeon hadoop1 ( node-master), and
DataNodeon hadoop2, hadoop3 and hadoop4, according to the configuration in the slaves config file.
Check that every process is running with the jps command on each node. You should get on node-master (PID will be different):
HDFSon master and slave nodes, run the following command from node-master:Copy
- You can also automatically use the friendlier web user interface. Point your browser to http://hadoop1:50070 and you’ll get a user-friendly monitoring console.
7.2 Monitor your HDFS Cluster
You can get useful information about running your
HDFScluster with the hdfs dfsadmin command. Try for example:Copy
$ hdfs dfsadmin -report
7.3 Create your home directory
First, manually create your home directory. All other commands will use a path relative to this default home directory:
$ hdfs dfs -mkdir /user $ hdfs dfs -mkdir /user/hadoop $ hdfs dfs -ls /
7.4 Put and Get Data to HDF
Now w can start writing and reading to
HDFS is done with command hdfs dfs.
Let’s use some textbooks from the Gutenberg project as an example
Create a books directory in
HDFS. The following command will create it in the home directory,
$ hdfs dfs -mkdir books $ hdfs dfs -ls -R /
Grab a few books from the Gutenberg project:
$ wget -O alice.txt https://www.gutenberg.org/files/11/11-0.txt $ wget -O holmes.txt https://www.gutenberg.org/ebooks/1661.txt.utf-8 $ wget -O frankenstein.txt http://www.gutenberg.org/files/84/84-0.txt
Put the three books through
HDFS, in the booksdirectoryCopy
$ hdfs dfs -put alice.txt holmes.txt frankenstein.txt books
List the contents of the book directory:
$ hdfs dfs -ls books
You can also directly print the books from
$ hdfs dfs -cat books/alice.txt
There are many commands to manage your
HDFS. For a complete list, you can look at the
Apache HDFS shell
documentation, or print help with:
$ hdfs dfs -help
7.5 Auto Start-Stop Hadoop using systemd
To start and stop hadoop when system start or shutdown, you can create a systemd service unit at /etc/systemd/system/hadoop.service .
[Unit] Description=Hadoop DFS namenode and datanode After=syslog.target network.target remote-fs.target nss-lookup.target network-online.target Requires=network-online.target [Service] User=hadoop Group=hadoop Type=forking ExecStart=/home/hadoop/hadoop/sbin/start-dfs.sh ExecStop=/home/hadoop/hadoop/sbin/stop-dfs.sh WorkingDirectory=/home/hadoop/ Environment=JAVA_HOME=/usr/lib/jvm/java-1.8.0 Environment=HADOOP_HOME=/home/hadoop/hadoop TimeoutStartSec=2min Restart=on-failure PIDFile=/tmp/hadoop-hadoop-namenode.pid [Install] WantedBy=multi-user.target
Now that we have our systemd script, we can now start our Hadoop daemons from systemd:
systemctl start hadoop.service # Start service
To run Hadoop daemons at startup:
systemctl enable hadoop.service
Finally, to stop your daemons:
systemctl stop hadoop.service
7.6 Exit Hadoop safe mode
When hadoop engine is stoped improperlly or in other extrange circumpstances, hadoop can enter in safe mode. Then, trying to use hadoop throws exceptions similar than this:
org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot create directory /tmp/hive/hadoop/15f0a0a4-74b9-40db-b8f9-0ddc1d7de9dd. Name node is in safe mode.
To exit safe mode and enable normal hadoop operation, you can use this administration command:
hadoop dfsadmin -safemode leave
8 Run YARN
HDFS is a distributed storage system, it doesn’t provide any services for running and scheduling tasks in the cluster.
This is the role of the
YARN framework. The following section is about starting, monitoring, and submitting jobs to
8.1 Start and Stop YARN
YARNwith the script:Copy
Check that everything is running with the jps command.
In addition to the previous
HDFSdaemon, you should see a
ResourceManageron hadoop1 ( node-master), and a
NodeManageron hadoop2, hadoop3 and hadoop4.Copy
YARN, run the following command on hadoop1 ( node-master):Copy
8.2 Add the Job History server
MapReduce JobHistory Server with the following command, run on the designated server as mapred,
that in our case is the first node for simplicity.
$ mr-jobhistory-daemon.sh start historyserver
8.3 Monitor YARN
The yarn command provides utilities to manage your
YARN cluster. You can also print a report of running nodes with the command:
$ yarn node -list
YARN provides a friendlier web UI, started by default on port 8088 of the Resource Manager.
Point your browser to http://hadoop1:8088 and browse the UI.
8.4 Submit MapReduce Jobs to YARN
Yarn jobs are packaged into jar files and submitted to
YARN for execution with the command yarn jar.
The Hadoop installation package provides sample applications that can be run to test your cluster.
You’ll use them to run a word count on the three books previously uploaded to
Submit a job with the sample jar to
YARN. On hadoop1 ( node-master), run:Copy
$ hdfs dfs -rm -r output $ yarn jar ~/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.2.jar wordcount "books/*" output
HDFS. In this example, it will be a directory named "output" itself.
After the job is finished, you can get the result by querying
HDFSwith hdfs dfs -ls output. In case of a success, the output will resemble:Copy
$ hdfs dfs -ls output
Print the result with:
$ hdfs dfs -cat output/part-r-00000