Let's see how to configure a single Hadoop instance (priamry on MacOS) for test purposes. The idea is to get familiar with Hadoop concepts before we go into a cluster configuration.

1 Installation on Linux

2 Installation on Mac OS X

2.1 Install brew

Homebrew is a free and open-source software package management system that simplifies the installation of software on Apple's macOS operating system.

Copy
$ ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
==> This script will install:
/usr/local/bin/brew
/usr/local/share/doc/homebrew
/usr/local/share/man/man1/brew.1
/usr/local/share/zsh/site-functions/_brew
/usr/local/etc/bash_completion.d/brew
/usr/local/Homebrew
...

2.2 Install Java

Ensure you have Java installed.

Copy
$ java -version
Java HotSpot(TM) 64-Bit Server VM (build 25.91-b14, mixed mode)

Check java location

Copy
$echo $(/usr/libexec/java_home)
/Library/Java/JavaVirtualMachines/jdk1.8.0_91.jdk/Contents/Home

2.3 Configure ssh

You need to enable ssh on hadoop machine. Ensure ssh is enabled.

Copy
$ sudo systemsetup -getremotelogin
Remote Login: On

When hadoop is installed in distributed mode, it uses a password less SSH for master to slave communication. To enable SSH daemon in mac, go to System Preferences => Sharing. Then click on Remote Login to enable SSH. Execute the following commands on the terminal to enable password less login to SSH:

Copy
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
Copy
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys

Now you can run without the need of entering password a simple command like:

Copy
$ ssh localhost ls
Applications
Desktop
Documents
Downloads
...

2.4 Download using brew

Copy
$ brew install hadoop
brew install hadoop
==> Downloading https://www.apache.org/dyn/closer.cgi?path=hadoop/common/hadoop-2.8.2/hadoop-2.8.2.tar.gz
==> Best Mirror http://apache.rediris.es/hadoop/common/hadoop-2.8.2/hadoop-2.8.2.tar.gz
######################################################################## 100.0%
==> Caveats
In Hadoop's config file:
  /usr/local/opt/hadoop/libexec/etc/hadoop/hadoop-env.sh,
  /usr/local/opt/hadoop/libexec/etc/hadoop/mapred-env.sh and
  /usr/local/opt/hadoop/libexec/etc/hadoop/yarn-env.sh
$JAVA_HOME has been set to be the output of:
  /usr/libexec/java_home

As you can see from brew logs, Hadoop is installed under /usr/local/Cellar/hadoop/2.8.2 and have links to /usr/local/opt/hadoop

3 Configure Hadoop

Modify various Hadoop configuration files to properly setup Hadoop and yarn. These files are located under /usr/local/opt/hadoop/libexec/etc/hadoop directory.

Copy

/usr/local/opt/hadoop/libexec/etc/hadoop/core-site.xml

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>
Copy

/usr/local/opt/hadoop/libexec/etc/hadoop/hdfs-site.xml

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>
Copy

/usr/local/opt/hadoop/libexec/etc/hadoop/mapred-site.xml

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>
Copy

/usr/local/opt/hadoop/libexec/etc/hadoop/yarn-site.xml

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.env-whitelist</name>
        <value>JAVA_HOME, HADOOP_COMMON_HOME, HADOOP_HDFS_HOME, HADOOP_CONF_DIR, CLASSPATH_PREPEND_DISTCACHE, HADOOP_YARN_HOME, HADOOP_MAPRED_HOME
        </value>
    </property>
    <property>
        <name>yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage
        </name>
        <value>85.0</value>
    </property>
</configuration>

Note the use of disk utilization threshold above. This tells yarn to continue operations when disk utilization is below 85.0%. The default value for this is 90%. If disk utilization goes above the configured threshold, yarn will report the node instance as unhealthy nodes with error "local-dirs are bad".

4 Initialize Hadoop Cluster

From a terminal window switch to the hadoop home folder (the folder which contains various sub folders such as bin and etc).

Copy
$ cd /usr/local/opt/hadoop
$ ls -l
-rw-r--r--   1 deister  admin    578  9 feb 21:43 INSTALL_RECEIPT.json
-rw-r--r--   1 deister  admin  99253 19 oct 23:11 LICENSE.txt
-rw-r--r--   1 deister  admin  15915 19 oct 23:11 NOTICE.txt
-rw-r--r--   1 deister  admin   1366 19 oct 23:11 README.txt
drwxr-xr-x   8 deister  admin    272  9 feb 21:43 bin
drwxr-xr-x   8 deister  admin    272  9 feb 21:53 libexec
drwxr-xr-x  23 deister  admin    782  9 feb 21:43 sbin

Run the following command to initialize the metadata for the hadoop cluster. This formats the hdfs file system and configures it on the local system.

By default, files are created in /tmp/hadoop-<username> folder. It is possible to modify the default location of name node configuration by configuring the hdfs-site.xml file. Similarly the hdfs data block storage location can be changed using dfs.data.dir property.
Copy
$ hdfs namenode -format
18/02/09 21:51:44 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   user = deister
STARTUP_MSG:   host = imac-test-local/192.168.1.69
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 2.8.2
STARTUP_MSG:   classpath = /usr/local/Cellar/hadoop/2.8.2/libexec/etc/hadoop: ....    
...
STARTUP_MSG:   build = https://git-wip-us.apache.org/repos/asf/hadoop.git -r 66c47f2a01ad9637879e95f80c41f798373828fb; compiled by 'jdu' on 2017-10-19T20:39Z
STARTUP_MSG:   java = 1.8.0_91
...

18/02/09 21:51:47 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
18/02/09 21:51:47 INFO util.ExitUtil: Exiting with status 0
18/02/09 21:51:47 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at imac-test.local/192.168.1.69
************************************************************/

5 Start Hadoop Cluster

Run the following command from terminal (after switching to hadoop home folder) to start the Hadoop cluster. This starts name node and data node on the local system.

Copy
$ sbin/start-dfs.sh
Starting namenodes on [localhost]
localhost: starting namenode, logging to /usr/local/Cellar/hadoop/2.8.2/libexec/logs/hadoop-deister-namenode-iMac-test.local.out
localhost: starting datanode, logging to /usr/local/Cellar/hadoop/2.8.2/libexec/logs/hadoop-deister-datanode-iMac-test.local.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/Cellar/hadoop/2.8.2/libexec/logs/hadoop-deister-secondarynamenode-iMac-test.out

To verify that the namenode and datanode daemons are running, execute the following command on the terminal. This displays running Java processes on the system.

Copy
$ jps
19203 DataNode
29219 Jps
19126 NameNode
19303 SecondaryNameNode

We see a datanode and namenode on same server as we are deploying a single node hadoop. When running on cluster, namenode with not contain a datanode.

If namenode or datanode is no started, review log files during start-dfs.sh

6 Working with HDFS

You cannot directly browse HDFS from terminal using cat or similar commands. HDFS is a logical file system and does not directly map to Unix file system. You should have an HDFS client and your Hadoop cluster should be running. When you browse HDFS, you are getting your directory structure from namenode and actual data from datanodes.

Although you cannot browse, data is there stored by datanode daemon. By default HDFS uses a series of default values as specified in hdfs-default.

Name Value Description
dfs.name.dir ${hadoop.tmp.dir}/dfs/name Determines where on the local filesystem the DFS name node should store the name table(fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.
dfs.data.dir ${hadoop.tmp.dir}/dfs/data Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored.
fs.replication 3 Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.

As our user name is deister in this example, we will found data under /tmp/hadoop-deister/dfs. To change storage location to appropiate filesystem, edit the file:

/usr/local/opt/hadoop/libexec/etc/hadoop/hdfs-site.xml

and set the dfs.name.dir and dfs.data.dir properties according.

Every time you change those properties you sould format HDFS by doing

Copy
$ hadoop namenode -format

6.1 Configure HDFS Home Directories

We will now configure the hdfs home directories. The home directory is of the form – /user/<username> so you need to create two directories.

Copy
$ hdfs dfs -mkdir /user
$ hdfs dfs -mkdir /user/`whoami`

6.2 Create a directory

Copy
$ hdfs dfs -mkdir test

6.3 List directories

List root directory

Copy
$ hdfs dfs -ls /
drwxr-xr-x   - deister supergroup          0 2018-02-10 23:28 /user

List root directory recursive

Copy
$ hdfs dfs -ls -R /
drwxr-xr-x   - deister supergroup          0 2018-02-10 23:28 /user
drwxr-xr-x   - deister supergroup          0 2018-02-10 23:29 /user/deister
drwxr-xr-x   - deister supergroup          0 2018-02-10 23:29 /user/deister/test

6.4 Remove a diectory

Copy
$ hdfs dfs -rm -r test

6.5 Copy a file

Now, we can try to copy a file from out local directory to our HDFS home under /user/<username>

Copy
echo "Hello World" > sample.txt
$ hdfs dfs -copyFromLocal sample.txt .

6.6 Cat a file

Copy
$ hdfs dfs -cat sample.txt
Hello world

6.7 fsck

Copy
$ hdfs fsck /
FSCK started by deister (auth:SIMPLE) from /127.0.0.1 for path / at Fri Feb 09 23:14:04 CET 2018
..Status: HEALTHY
 Total size:	31830 B
 Total dirs:	4
 Total files:	2
 Total symlinks:		0
 Total blocks (validated):	2 (avg. block size 15915 B)
 Minimally replicated blocks:	2 (100.0 %)
 Over-replicated blocks:	0 (0.0 %)
 Under-replicated blocks:	0 (0.0 %)
 Mis-replicated blocks:		0 (0.0 %)
 Default replication factor:	1
 Average block replication:	1.0
 Corrupt blocks:		0
 Missing replicas:		0 (0.0 %)
 Number of data-nodes:		1
 Number of racks:		1
FSCK ended at Fri Feb 09 23:14:04 CET 2018 in 1 milliseconds


The filesystem under path '/' is HEALTHY
Copy
$ hdfs fsck / -files -blocks
....

7 Run YARN Manager

Start YARN resource manager and node manager instances by running the following command on the terminal:

Copy
$ sbin/start-yarn.sh

Run jps command again to verify all the running processes,

Copy
19203 DataNode
29283 Jps 
19413 ResourceManager 
19126 NameNode 
19303 SecondaryNameNode 
19497 NodeManager

The presence of ResourceManager singals YARN is running.

8 Verify Hadoop

8.1 Verify name node

Access the URL

to view Hadoop name node configuration. You can also navigate the hdfs file system using the menu Utilities > Browse the file system.

8.2 Verify YARN

Access the URL

to view the hadoop cluster details through YARN resource manager.

9 Run Sample MapReduce Job

Hadoop installation contains a number of sample mapreduce jobs. We will run one of them to verify that our hadoop installation is working fine.

  1. We will first generate and copy a file from local system to the hdfs home folder.
    Copy
    cat > /tmp/words.txt
    hello
    bye
    mark
    mary
    mark
    alfred
    ^D
  2. Copy the file to your HDFS root
    Copy
    $ hdfs dfs -copyFromLocal /tmp/words.txt .
  3. Change to Hadoop program directory
    Copy
    $ cd /usr/local/opt/hadoop
  4. Let us run a mapreduce program on this HDFS file words.txt to find the number of occurrences of the word "mark" in the file. The results will be place in HDFS folder output.
    A mapreduce program for word count is available in the Hadoop samples.
    Copy
    $ hadoop jar ./libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.2.jar grep words.txt output 'mark'
    18/02/10 00:01:08 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
    18/02/10 00:01:09 INFO input.FileInputFormat: Total input files to process : 1
    18/02/10 00:01:09 INFO mapreduce.JobSubmitter: number of splits:1
    18/02/10 00:01:10 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1518216494723_0003
    18/02/10 00:01:10 INFO impl.YarnClientImpl: Submitted application application_1518216494723_0003
    18/02/10 00:01:10 INFO mapreduce.Job: The url to track the job: http://iMac-test.local:8088/proxy/application_1518216494723_0003/
    18/02/10 00:01:10 INFO mapreduce.Job: Running job: job_1518216494723_0003
    18/02/10 00:01:19 INFO mapreduce.Job: Job job_1518216494723_0003 running in uber mode : false
    18/02/10 00:01:19 INFO mapreduce.Job:  map 0% reduce 0%
    18/02/10 00:01:25 INFO mapreduce.Job:  map 100% reduce 0%
    ...

    This runs the mapreduce on the hdfs file uploaded earlier and then outputs the results to the output folder inside the hdfs home folder. The file will be named as part-r-00000. This can be downloaded from the name node browser console or run the following command to copy it to the local folder.

  5. Now, change to a working directory and download HDFS output folder to look into results.
    Copy
    $ cd /tmp
    $ hdfs dfs -get output/* .
    $ cd output
    $ ls -l
    total 8
    -rw-r--r--  1 deister  wheel  0 10 feb 00:14 _SUCCESS
    -rw-r--r--  1 deister  wheel  7 10 feb 00:14 part-r-00000
  6. Finally, check the results of the word count job searhing word mark
    Copy
    $ cat part-r-00000
    2	mark

10 Stop Hadoop/YARN Cluster

Run the following commands to stop Hadoop/YARN daemons. This stops name node, data node, node manager and resource manager.

Copy
$ cd /usr/local/opt/hadoop
Copy
$ sbin/stop-yarn.sh
Copy
$ sbin/stop-dfs.sh