Let's see how to configure a single Hadoop instance (priamry on MacOS) for test purposes. The idea is to get familiar with Hadoop concepts before we go into a cluster configuration.
1 Installation on Linux
2 Installation on Mac OS X
2.1 Install brew
Homebrew is a free and open-source software package management system that simplifies the installation of software on Apple's macOS operating system.
$ ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
==> This script will install:
/usr/local/bin/brew
/usr/local/share/doc/homebrew
/usr/local/share/man/man1/brew.1
/usr/local/share/zsh/site-functions/_brew
/usr/local/etc/bash_completion.d/brew
/usr/local/Homebrew
...
2.2 Install Java
Ensure you have Java installed.
$ java -version
Java HotSpot(TM) 64-Bit Server VM (build 25.91-b14, mixed mode)
Check java location
$echo $(/usr/libexec/java_home)
/Library/Java/JavaVirtualMachines/jdk1.8.0_91.jdk/Contents/Home
2.3 Configure ssh
You need to enable ssh on hadoop machine. Ensure ssh is enabled.
$ sudo systemsetup -getremotelogin
Remote Login: On
When hadoop is installed in distributed mode, it uses a password less SSH for master to slave communication. To enable SSH daemon in mac, go to System Preferences => Sharing. Then click on Remote Login to enable SSH. Execute the following commands on the terminal to enable password less login to SSH:
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys chmod 600 ~/.ssh/authorized_keys
Now you can run without the need of entering password a simple command like:
$ ssh localhost ls
Applications
Desktop
Documents
Downloads
...
2.4 Download using brew
$ brew install hadoop
brew install hadoop
==> Downloading https://www.apache.org/dyn/closer.cgi?path=hadoop/common/hadoop-2.8.2/hadoop-2.8.2.tar.gz
==> Best Mirror http://apache.rediris.es/hadoop/common/hadoop-2.8.2/hadoop-2.8.2.tar.gz
######################################################################## 100.0%
==> Caveats
In Hadoop's config file:
/usr/local/opt/hadoop/libexec/etc/hadoop/hadoop-env.sh,
/usr/local/opt/hadoop/libexec/etc/hadoop/mapred-env.sh and
/usr/local/opt/hadoop/libexec/etc/hadoop/yarn-env.sh
$JAVA_HOME has been set to be the output of:
/usr/libexec/java_home
As you can see from brew logs, Hadoop is installed under /usr/local/Cellar/hadoop/2.8.2
and have
links to /usr/local/opt/hadoop
3 Configure Hadoop
Modify various Hadoop configuration files to properly setup Hadoop and yarn.
These files are located under /usr/local/opt/hadoop/libexec/etc/hadoop
directory.
/usr/local/opt/hadoop/libexec/etc/hadoop/core-site.xml
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration>
/usr/local/opt/hadoop/libexec/etc/hadoop/hdfs-site.xml
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
/usr/local/opt/hadoop/libexec/etc/hadoop/mapred-site.xml
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
/usr/local/opt/hadoop/libexec/etc/hadoop/yarn-site.xml
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.env-whitelist</name> <value>JAVA_HOME, HADOOP_COMMON_HOME, HADOOP_HDFS_HOME, HADOOP_CONF_DIR, CLASSPATH_PREPEND_DISTCACHE, HADOOP_YARN_HOME, HADOOP_MAPRED_HOME </value> </property> <property> <name>yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage </name> <value>85.0</value> </property> </configuration>
Note the use of disk utilization threshold above. This tells yarn to continue operations when disk utilization is below 85.0%.
The default value for this is 90%. If disk utilization goes above the configured threshold,
yarn will report the node instance as unhealthy nodes with error "local-dirs are bad"
.
4 Initialize Hadoop Cluster
From a terminal window switch to the hadoop home folder (the folder which contains various sub folders such as bin and etc).
$ cd /usr/local/opt/hadoop $ ls -l
-rw-r--r-- 1 deister admin 578 9 feb 21:43 INSTALL_RECEIPT.json
-rw-r--r-- 1 deister admin 99253 19 oct 23:11 LICENSE.txt
-rw-r--r-- 1 deister admin 15915 19 oct 23:11 NOTICE.txt
-rw-r--r-- 1 deister admin 1366 19 oct 23:11 README.txt
drwxr-xr-x 8 deister admin 272 9 feb 21:43 bin
drwxr-xr-x 8 deister admin 272 9 feb 21:53 libexec
drwxr-xr-x 23 deister admin 782 9 feb 21:43 sbin
Run the following command to initialize the metadata for the hadoop cluster. This formats the hdfs file system and configures it on the local system.
/tmp/hadoop-<username>
folder.
It is possible to modify the default location of name node configuration by
configuring the hdfs-site.xml
file.
Similarly the hdfs data block storage location can be changed using dfs.data.dir
property.
$ hdfs namenode -format
18/02/09 21:51:44 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: user = deister
STARTUP_MSG: host = imac-test-local/192.168.1.69
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 2.8.2
STARTUP_MSG: classpath = /usr/local/Cellar/hadoop/2.8.2/libexec/etc/hadoop: ....
...
STARTUP_MSG: build = https://git-wip-us.apache.org/repos/asf/hadoop.git -r 66c47f2a01ad9637879e95f80c41f798373828fb; compiled by 'jdu' on 2017-10-19T20:39Z
STARTUP_MSG: java = 1.8.0_91
...
18/02/09 21:51:47 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
18/02/09 21:51:47 INFO util.ExitUtil: Exiting with status 0
18/02/09 21:51:47 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at imac-test.local/192.168.1.69
************************************************************/
5 Start Hadoop Cluster
Run the following command from terminal (after switching to hadoop home folder) to start the Hadoop cluster. This starts name node and data node on the local system.
$ sbin/start-dfs.sh
Starting namenodes on [localhost]
localhost: starting namenode, logging to /usr/local/Cellar/hadoop/2.8.2/libexec/logs/hadoop-deister-namenode-iMac-test.local.out
localhost: starting datanode, logging to /usr/local/Cellar/hadoop/2.8.2/libexec/logs/hadoop-deister-datanode-iMac-test.local.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/Cellar/hadoop/2.8.2/libexec/logs/hadoop-deister-secondarynamenode-iMac-test.out
To verify that the namenode
and datanode
daemons are running, execute the following command on the terminal.
This displays running Java processes on the system.
$ jps
19203 DataNode
29219 Jps
19126 NameNode
19303 SecondaryNameNode
We see a datanode
and namenode
on same server as we are deploying a single node
hadoop. When running on cluster, namenode
with not contain a datanode
.
If namenode
or datanode
is no started, review log files during start-dfs.sh
6 Working with HDFS
You cannot directly browse HDFS
from terminal using cat or similar commands.
HDFS
is a logical file system and does not directly map to Unix file system.
You should have an HDFS
client and your Hadoop cluster should be running.
When you browse HDFS
, you are getting your directory structure from namenode
and actual data from datanodes.
Although you cannot browse, data is there stored by datanode
daemon. By default HDFS
uses
a series of default values as specified in hdfs-default.
Name | Value | Description |
---|---|---|
dfs.name.dir | ${hadoop.tmp.dir}/dfs/name | Determines where on the local filesystem the DFS name node should store the name table(fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy. |
dfs.data.dir | ${hadoop.tmp.dir}/dfs/data | Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored. |
fs.replication | 3 | Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. |
As our user name is deister
in this example, we will found data under /tmp/hadoop-deister/dfs
.
To change storage location to appropiate filesystem, edit the file:
/usr/local/opt/hadoop/libexec/etc/hadoop/hdfs-site.xml
and set the dfs.name.dir
and dfs.data.dir
properties according.
Every time you change those properties you sould format HDFS
by doing
$ hadoop namenode -format
6.1 Configure HDFS Home Directories
We will now configure the hdfs home directories. The home directory is of the form – /user/<username>
so you need to create two directories.
$ hdfs dfs -mkdir /user $ hdfs dfs -mkdir /user/`whoami`
6.2 Create a directory
$ hdfs dfs -mkdir test
6.3 List directories
List root directory
$ hdfs dfs -ls /
drwxr-xr-x - deister supergroup 0 2018-02-10 23:28 /user
List root directory recursive
$ hdfs dfs -ls -R /
drwxr-xr-x - deister supergroup 0 2018-02-10 23:28 /user
drwxr-xr-x - deister supergroup 0 2018-02-10 23:29 /user/deister
drwxr-xr-x - deister supergroup 0 2018-02-10 23:29 /user/deister/test
6.4 Remove a diectory
$ hdfs dfs -rm -r test
6.5 Copy a file
Now, we can try to copy a file from out local directory to our HDFS
home under /user/<username>
echo "Hello World" > sample.txt $ hdfs dfs -copyFromLocal sample.txt .
6.6 Cat a file
$ hdfs dfs -cat sample.txt
Hello world
6.7 fsck
$ hdfs fsck /
FSCK started by deister (auth:SIMPLE) from /127.0.0.1 for path / at Fri Feb 09 23:14:04 CET 2018
..Status: HEALTHY
Total size: 31830 B
Total dirs: 4
Total files: 2
Total symlinks: 0
Total blocks (validated): 2 (avg. block size 15915 B)
Minimally replicated blocks: 2 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 1
Average block replication: 1.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 1
Number of racks: 1
FSCK ended at Fri Feb 09 23:14:04 CET 2018 in 1 milliseconds
The filesystem under path '/' is HEALTHY
$ hdfs fsck / -files -blocks
....
7 Run YARN Manager
Start YARN
resource manager and node manager instances by running the following command on the terminal:
$ sbin/start-yarn.sh
Run jps command again to verify all the running processes,
19203 DataNode 29283 Jps 19413 ResourceManager 19126 NameNode 19303 SecondaryNameNode 19497 NodeManager
The presence of ResourceManager
singals YARN
is running.
8 Verify Hadoop
8.1 Verify name node
Access the URL
- Prior to Hadoop 3: http://localhost:50070/dfshealth.html
- Since Hadoop 3.0.0 - Alpha 1: http://localhost:9870/dfshealth.html

8.2 Verify YARN
Access the URL
to view the hadoop cluster details throughYARN
resource manager.
9 Run Sample MapReduce Job
Hadoop installation contains a number of sample mapreduce jobs. We will run one of them to verify that our hadoop installation is working fine.
-
We will first generate and copy a file from local system to the hdfs home folder.
Copy
cat > /tmp/words.txt hello bye mark mary mark alfred ^D
-
Copy the file to your
HDFS
rootCopy$ hdfs dfs -copyFromLocal /tmp/words.txt .
-
Change to Hadoop program directory
Copy
$ cd /usr/local/opt/hadoop
-
Let us run a mapreduce program on this
HDFS
filewords.txt
to find the number of occurrences of the word "mark" in the file. The results will be place inHDFS
folderoutput
.A mapreduce program for word count is available in the Hadoop samples.Copy$ hadoop jar ./libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.2.jar grep words.txt output 'mark'
18/02/10 00:01:08 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 18/02/10 00:01:09 INFO input.FileInputFormat: Total input files to process : 1 18/02/10 00:01:09 INFO mapreduce.JobSubmitter: number of splits:1 18/02/10 00:01:10 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1518216494723_0003 18/02/10 00:01:10 INFO impl.YarnClientImpl: Submitted application application_1518216494723_0003 18/02/10 00:01:10 INFO mapreduce.Job: The url to track the job: http://iMac-test.local:8088/proxy/application_1518216494723_0003/ 18/02/10 00:01:10 INFO mapreduce.Job: Running job: job_1518216494723_0003 18/02/10 00:01:19 INFO mapreduce.Job: Job job_1518216494723_0003 running in uber mode : false 18/02/10 00:01:19 INFO mapreduce.Job: map 0% reduce 0% 18/02/10 00:01:25 INFO mapreduce.Job: map 100% reduce 0% ...
This runs the mapreduce on the hdfs file uploaded earlier and then outputs the results to the
output
folder inside the hdfs home folder. The file will be named as part-r-00000. This can be downloaded from the name node browser console or run the following command to copy it to the local folder. -
Now, change to a working directory and download
HDFS
output
folder to look into results.Copy$ cd /tmp $ hdfs dfs -get output/* . $ cd output $ ls -l
total 8 -rw-r--r-- 1 deister wheel 0 10 feb 00:14 _SUCCESS -rw-r--r-- 1 deister wheel 7 10 feb 00:14 part-r-00000
-
Finally, check the results of the word count job searhing word
mark
Copy$ cat part-r-00000
2 mark
10 Stop Hadoop/YARN Cluster
Run the following commands to stop Hadoop/YARN
daemons.
This stops name node, data node, node manager and resource manager.
$ cd /usr/local/opt/hadoop
$ sbin/stop-yarn.sh
$ sbin/stop-dfs.sh