This document describes a test of TPCH benchmark at low scales (10, 20, 50 and 100) using Hadoop on single NUC computer, plus a final test for scale 1000 on a cluster of 10 NUC computers (10 times scale 100).

You can read more about Hadoop TPCH test on the following documents:

1 Install Haddoop

1.1 Install OpenJDK 8 JDK

To install OpenJDK 8 JDK using yum, run this command as root:

Copy
# yum install java-1.8.0-openjdk-devel

Afeter completed, check for java

Copy
# java -version
openjdk version "1.8.0_191"
OpenJDK Runtime Environment (build 1.8.0_191-b12)
OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)

1.2 Download Hadoop release

With Java in place, we'll visit the Apache Hadoop Releases page to find the most recent stable release. Follow the binary for the current release.

Click on binary for the desired release. You will be redirected to the download page.

You can use wget on a given release to directly download the archive.

Copy
# cd /home    
# wget http://apache.uvigo.es/hadoop/common/hadoop-3.0.3/hadoop-3.0.3.tar.gz
--2018-11-22 14:22:13--  http://apache.uvigo.es/hadoop/common/hadoop-3.0.3/hadoop-3.0.3.tar.gz
Resolving apache.uvigo.es (apache.uvigo.es)... 193.146.32.74, 2001:720:1214:4200::74
Connecting to apache.uvigo.es (apache.uvigo.es)|193.146.32.74|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 314322972 (300M) [application/x-gzip]
Saving to: ‘hadoop-3.0.3.tar.gz’

100%[=======================================================================>] 314,322,972 7.46MB/s   in 41s    

2018-11-22 14:22:54 (7.38 MB/s) - ‘hadoop-3.0.3.tar.gz’ saved [314322972/314322972]

1.3 Extract files from archive

Extract hadoop fron file. This will create a directory with hadoop version number.

Copy
# cd /home    
# tar xvzf hadoop-3.0.3.tar.gz
hadoop-3.0.3/
hadoop-3.0.3/LICENSE.txt
hadoop-3.0.3/NOTICE.txt
hadoop-3.0.3/README.txt
hadoop-3.0.3/bin/
hadoop-3.0.3/bin/hadoop
...

1.4 Create a hadoop user account

From root, create the hadoop user accoount.

Copy
# useradd hadoop

Asumming we have installed Hadoop in /home, we can create a link for hadoop user as:

Copy
# cd /home    
# ln -s /home/hadoop-3.0.3 /home/hadoop

Now, ensure all files are owned by hadoop user.

Copy
# cd /home    
# chown -R hadoop:hadoop hadoop-3.0.3

1.5 Login as hadoop

Ensure everithing is correct by login as hadoop.

Copy
# su - hadoop    
$ hadoop@nuc10
$ ls -l
ls -l
total 176
drwxr-xr-x 2 hadoop hadoop    183 May 31 19:36 bin
drwxr-xr-x 3 hadoop hadoop     20 May 31 19:13 etc
drwxr-xr-x 2 hadoop hadoop    106 May 31 19:36 include
drwxr-xr-x 3 hadoop hadoop     20 May 31 19:36 lib
drwxr-xr-x 4 hadoop hadoop    288 May 31 19:36 libexec
-rw-rw-r-- 1 hadoop hadoop 147066 May 29 19:58 LICENSE.txt
-rw-rw-r-- 1 hadoop hadoop  20891 May 29 19:58 NOTICE.txt
-rw-r--r-- 1 hadoop hadoop   1366 May 26 20:43 README.txt
drwxr-xr-x 3 hadoop hadoop   4096 May 31 19:13 sbin
drwxr-xr-x 4 hadoop hadoop     31 May 31 19:50 share

Create .bash_profile in /home/hadoop to setup bash prompt if need.

Copy
$ cat > .bash_profile
# .bash_profile

# Get the aliases and functions
if [ -f ~/.bashrc ]; then
	. ~/.bashrc
fi
^D

1.6 Configuring Hadoop's Java Home

Hadoop requires that you set the path to Java, either as an environment variable or in the Hadoop configuration file.

The path to Java, /usr/bin/java is a symlink to /etc/alternatives/java, which is in turn a symlink to default Java binary. We will use readlink with the -f flag to follow every symlink in every part of the path, recursively. Then, we'll use sed to trim bin/java from the output to give us the correct value for JAVA_HOME.

To find the default Java path

Copy
readlink -f /usr/bin/java | sed "s:bin/java::"
/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.191.b12-0.el7_5.x86_64/jre/

You can copy this output to set Hadoop's Java home to this specific version, which ensures that if the default Java changes, this value will not. Alternatively, you can use the readlink command dynamically in the file so that Hadoop will automatically use whatever Java version is set as the system default.

To begin, open hadoop-env.sh:

Copy
$ vi /home/hadoop/etc/hadoop/hadoop-env.sh

1.6.1 Option 1: Set a Static Value

Copy
#export JAVA_HOME=${JAVA_HOME}
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/
<section title="Option 2: Use Readlink to Set the Value Dynamically">

1.6.2 Option 2: Use Readlink to Set the Value Dynamically

Copy
#export JAVA_HOME=${JAVA_HOME}
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")

1.7 Verify Hadoop installation

Now we should be able to run Hadoop. Login as hadoop and run:

Copy
$ bin/hadoop
Usage: hadoop [OPTIONS] SUBCOMMAND [SUBCOMMAND OPTIONS]
 or    hadoop [OPTIONS] CLASSNAME [CLASSNAME OPTIONS]
  where CLASSNAME is a user-provided Java class

  OPTIONS is none or any of:

...

The help means we've successfully configured Hadoop to run in stand-alone mode.

2 Setup Hadoop

Now, as we have Hadoop properly installed we will proceed to configure it.

2.1 Configure ssh

You need to enable ssh on hadoop machine. Ensure ssh is enabled.

Copy
$ sudo systemsetup -getremotelogin
Remote Login: On

When hadoop is installed in distributed mode, it uses a password less SSH for master to slave communication. To enable SSH daemon in mac, go to System Preferences => Sharing. Then click on Remote Login to enable SSH. Execute the following commands on the terminal to enable password less login to SSH:

Copy
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
Copy
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys

Now you can run without the need of entering password a simple command like:

Copy
$ ssh localhost ls
Applications
Desktop
Documents
Downloads
...

2.2 Configure Hadoop

Modify various Hadoop configuration files to properly setup Hadoop and yarn. These files are located under etc/hadoop directory.

2.2.1 Shell environment variables

Copy

.bash_profile

export HADOOP_HOME=/home/hadoop
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

2.2.2 Haddop configuration files

At core-site.xml configure the master node.

Copy

etc/hadoop/core-site.xml

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://master-node:9000</value>
    </property>
</configuration>

As for example:

Copy

etc/hadoop/core-site.xml

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://nuc10:9000</value>
    </property>
</configuration>
Copy

etc/hadoop/hdfs-site.xml

<property>
    <name>dfs.name.dir</name>
    <value>/data/hadoop/hdfs/name</value>
    <final>true</final>
  </property>

  <property>
    <name>dfs.data.dir</name>
    <value>/data/hadoop/hdfs/data</value>
    <final>true</final>
  </property>

  <property>
     <name>dfs.replication</name>
     <value>1</value>
  </property>
Copy

etc/hadoop/mapred-site.xml

<property>
        <name>yarn.app.mapreduce.am.env</name>
        <value>HADOOP_MAPRED_HOME=$HADOOP_MAPRED_HOME</value>
    </property>
    <property>
        <name>mapreduce.map.env</name>
        <value>HADOOP_MAPRED_HOME=$HADOOP_MAPRED_HOME</value>
    </property>
    <property>
        <name>mapreduce.reduce.env</name>
        <value>HADOOP_MAPRED_HOME=$HADOOP_MAPRED_HOME</value>
    </property>
    
    <!-- required to run mapreduce on master node -->
    <property> 
        <name>mapreduce.application.classpath</name>
        <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*,$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/common/*,$HADOOP_MAPRED_HOME/share/hadoop/common/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/yarn/*,$HADOOP_MAPRED_HOME/share/hadoop/yarn/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/hdfs/*,$HADOOP_MAPRED_HOME/share/hadoop/hdfs/lib/*</value>
    </property>

</configuration>

At yarn-site.xml configure the yarn.resourcemanager.hostname to the master-node.

Copy

etc/hadoop/yarn-site.xml

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.env-whitelist</name>
        <value>JAVA_HOME, HADOOP_COMMON_HOME, HADOOP_HDFS_HOME, HADOOP_CONF_DIR, CLASSPATH_PREPEND_DISTCACHE, HADOOP_YARN_HOME, HADOOP_MAPRED_HOME
        </value>
    </property>
    <property>
        <name>yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage
        </name>
        <value>85.0</value>
    </property>
    
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>nuc10</value>
    </property>    
    
</configuration>

Note the use of disk utilization threshold above. This tells yarn to continue operations when disk utilization is below 85.0%. The default value for this is 90%. If disk utilization goes above the configured threshold, yarn will report the node instance as unhealthy nodes with error "local-dirs are bad".

2.3 Change location of hadoop data

To change storage location to appropriate filesystem, edit the file:

etc/hadoop/hdfs-site.xml

and set the dfs.name.dir and dfs.data.dir properties according.

Copy
<configuration>

  <property>
    <name>dfs.name.dir</name>
    <value>/data/hadoop/hdfs/name</value>
    <final>true</final>
  </property>

  <property>
    <name>dfs.data.dir</name>
    <value>/data/hadoop/hdfs/data</value>
    <final>true</final>
  </property>

  <property>
     <name>dfs.replication</name>
     <value>3</value>
  </property>

</configuration>

2.4 Configure Slaves

Execute the following commands on the master-node (nuc10).

The file workers is used by startup scripts to start required daemons on all nodes. Edit etc/hadoop/slaves to be:

Copy

etc/hadoop/workers

nuc00
nuc01
nuc02
nuc03
nuc04
nuc05
nuc06
nuc07
nuc08
nuc00

2.5 Initialize Hadoop Cluster

Login as hadoop

Copy
$ ls -l
ls -l
total 176
drwxr-xr-x 2 hadoop hadoop    183 May 31 19:36 bin
drwxr-xr-x 3 hadoop hadoop     20 May 31 19:13 etc
drwxr-xr-x 2 hadoop hadoop    106 May 31 19:36 include
drwxr-xr-x 3 hadoop hadoop     20 May 31 19:36 lib
drwxr-xr-x 4 hadoop hadoop    288 May 31 19:36 libexec
-rw-rw-r-- 1 hadoop hadoop 147066 May 29 19:58 LICENSE.txt
drwxrwxr-x 2 hadoop hadoop     39 Nov 22 16:47 logs
-rw-rw-r-- 1 hadoop hadoop  20891 May 29 19:58 NOTICE.txt
-rw-r--r-- 1 hadoop hadoop   1366 May 26 20:43 README.txt
drwxr-xr-x 3 hadoop hadoop   4096 May 31 19:13 sbin
drwxr-xr-x 4 hadoop hadoop     31 May 31 19:50 share

Run the following command to initialize the metadata for the hadoop cluster. This formats the hdfs file system and configures it on the local system.

By default, files are created in /tmp/hadoop-<username> folder. It is possible to modify the default location of name node configuration by configuring the hdfs-site.xml file. Similarly the hdfs data block storage location can be changed using dfs.data.dir property.

Copy the content from the master at etc/hadoop to the rest of nodes.

Copy
$ hdfs namenode -format
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = nuc10/192.168.9.180
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 3.0.3
STARTUP_MSG:   classpath = /home/hadoop-3.0.3/etc/hadoop:/....
...
...
2018-11-22 16:47:03,930 INFO common.Storage: Storage directory /tmp/hadoop-hadoop/dfs/name has been successfully formatted.
2018-11-22 16:47:03,937 INFO namenode.FSImageFormatProtobuf: Saving image file /tmp/hadoop-hadoop/dfs/name/current/fsimage.ckpt_0000000000000000000 using no compression
2018-11-22 16:47:03,991 INFO namenode.FSImageFormatProtobuf: Image file /tmp/hadoop-hadoop/dfs/name/current/fsimage.ckpt_0000000000000000000 of size 391 bytes saved in 0 seconds .
2018-11-22 16:47:03,999 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
2018-11-22 16:47:04,002 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at nuc10/192.168.9.180
************************************************************/

2.6 Start Hadoop Cluster

Configure passwordless ssh, need to add user machine's public key to server machines ~/.ssh/authorized_keys file. In this case, both the system are same machines. First of all, logged as haddop user, generate a private and public key:

Copy
$ su - hadoop
$ ssh-keygen

Finally, copy the public key to the authorized_keys file:

Copy
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Check the file permissions are " -rw-------", if not then execute:

Copy
$ chmod 600 ~/.ssh/authorized_keys

Run the following command from terminal (after switching to hadoop home folder) to start the Hadoop cluster. This starts name node and data node on the local system.

Copy
$ sbin/start-all.sh
Starting namenodes on [localhost]
Starting datanodes
Starting secondary namenodes [nuc10]
Starting resourcemanager
Starting nodemanagers

To verify that the namenode and datanode daemons are running, execute the following command on the terminal. This displays running Java processes on the system.

Copy
$ jps
7897 DataNode
1922 Jps
726 NameNode
1400 ResourceManager
1115 SecondaryNameNode
1611 NodeManager

We must see a datanode and namenode on same server as we are deploying a single node hadoop. When running on cluster, namenode with not contain a datanode.

2.6.1 Run YARN Manager

You can manually start YARN resource manager and node manager instances by running the following command on the terminal:

Copy
$ sbin/start-yarn.sh

Run jps command again to verify all the running processes,

Copy
19203 DataNode
29283 Jps 
19413 ResourceManager 
19126 NameNode 
19303 SecondaryNameNode 
19497 NodeManager

The presence of ResourceManager singals YARN is running. This command is not need if you start hadoop using start-all.sh

2.7 Start history server

Additionaly to start history server run:

Copy
$ bin/mapred --daemon start historyserver

You will se a "JobHistoryServer" process running under jps. Now you can access to history server console, by default, started on port 19888.

2.8 Stop Hadoop cluster

Simply run:

Copy
$ sbin/stop-all.sh

2.9 Working with HDFS

You cannot directly browse HDFS from terminal using cat or similar commands. HDFS is a logical file system and does not directly map to Unix file system. You should have an HDFS client and your Hadoop cluster should be running. When you browse HDFS, you are getting your directory structure from namenode and actual data from datanodes.

Although you cannot browse, data is there stored by datanode daemon. By default HDFS uses a series of default values as specified in hdfs-default.

Name Value Description
dfs.name.dir /data/hadoop/hdfs/name Determines where on the local filesystem the DFS name node should store the name table(fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.
dfs.data.dir /data/hadoop/hdfs/data Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored.
fs.replication 3 Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.

2.9.1 Configure HDFS Home Directories

We will now configure the hdfs home directories. The home directory is of the form – /user/<username> so you need to create two directories.

Copy
$ sbin/start-all.sh
Copy
$ hdfs dfs -mkdir /user
$ hdfs dfs -mkdir /user/`whoami`

2.9.2 Create a directory

Copy
$ hdfs dfs -mkdir test

2.9.3 List directories

List root directory

Copy
$ hdfs dfs -ls /
drwxr-xr-x   - hadoop supergroup          0 2018-11-22 17:15 /user

List root directory recursive

Copy
$ hdfs dfs -ls -R /
drwxr-xr-x   - hadoop supergroup          0 2018-11-22 17:15 /user
drwxr-xr-x   - hadoop supergroup          0 2018-11-22 17:15 /user/hadoop
drwxr-xr-x   - hadoop supergroup          0 2018-11-22 17:15 /user/hadoop/test

The following example shows a Hadoop filesystem storing various TPCH hive databases.

Copy
$ hdfs dfs -ls /user/hadoop/warehouse
Found 5 items
drwxr-xr-x   - hadoop supergroup          0 2018-11-26 20:06 /user/hadoop/warehouse/tpch_1.db
drwxr-xr-x   - hadoop supergroup          0 2018-11-27 16:42 /user/hadoop/warehouse/tpch_10.db
drwxr-xr-x   - hadoop supergroup          0 2018-11-23 16:50 /user/hadoop/warehouse/tpch_100.db
drwxr-xr-x   - hadoop supergroup          0 2018-11-26 20:19 /user/hadoop/warehouse/tpch_20.db
drwxr-xr-x   - hadoop supergroup          0 2018-11-26 20:33 /user/hadoop/warehouse/tpch_50.db

2.9.4 Remove a diectory

Copy
$ hdfs dfs -rm -r test

2.9.5 Copy a file

Now, we can try to copy a file from out local directory to our HDFS home under /user/<username>

Copy
$ echo "Hello World" > sample.txt
$ hdfs dfs -copyFromLocal sample.txt .

2.9.6 Cat a file

Copy
$ hdfs dfs -cat sample.txt
Hello world

2.9.7 fsck

Copy
$ hdfs fsck /
Connecting to namenode via http://localhost:9870/fsck?ugi=hadoop&path=%2F
FSCK started by hadoop (auth:SIMPLE) from /127.0.0.1 for path / at Thu Nov 22 17:17:10 CET 2018

Status: HEALTHY
 Number of data-nodes:	1
 Number of racks:		1
 Total dirs:			3
 Total symlinks:		0

Replicated Blocks:
 Total size:	12 B
 Total files:	1
 Total blocks (validated):	1 (avg. block size 12 B)
 Minimally replicated blocks:	1 (100.0 %)
 Over-replicated blocks:	0 (0.0 %)
 Under-replicated blocks:	0 (0.0 %)
 Mis-replicated blocks:		0 (0.0 %)
 Default replication factor:	1
 Average block replication:	1.0
 Missing blocks:		0
 Corrupt blocks:		0
 Missing replicas:		0 (0.0 %)

Erasure Coded Block Groups:
 Total size:	0 B
 Total files:	0
 Total block groups (validated):	0
 Minimally erasure-coded block groups:	0
 Over-erasure-coded block groups:	0
 Under-erasure-coded block groups:	0
 Unsatisfactory placement block groups:	0
 Average block group size:	0.0
 Missing block groups:		0
 Corrupt block groups:		0
 Missing internal blocks:	0
FSCK ended at Thu Nov 22 17:17:10 CET 2018 in 4 milliseconds


The filesystem under path '/' is HEALTHY
Copy
$ hdfs fsck / -files -blocks
....

2.9.8 Expunge

To reclaim disk space

Copy
$hdfs dfs -expunge

2.9.9 Disk usage

To see disk space

Copy
$hdfs dfs -du -h
38.7 M  38.7 M  .hiveJars
12      12      sample.txt
92.8 G  92.8 G  warehouse
66      66      words.txt

2.9.10 Using HDFS cache

Centralized Cache Management in HDFS is a mechanism that explicitly caches specific files or directories in memory for improved performance. This is useful for relatively small files that are accessed repeatedly. For example, reference/lookup tables or fact tables that are used in many joins. Once enabled, HDFS will automatically cache selected files, and periodically check for changes and recache the files.

While HDFS and the underlying file system do some caching of files when memory is available, explicit caching using Centralized Cache Management prevents the data from being evicted from memory when processes consume all of the physical memory. As a corollary of this, if you ARE working on a lightly loaded system where there is free memory, you may not see any performance improvement from this method, as the data was already in disk cache. So, your performance testing needs to stress the system.

Let’s look at some key terms and concepts:

  • Cache pools: A cache pool is an administrative entity used to manage groups of cache directives. One of the key attributes of the pool it the maximum number of bytes that can be cached for all directives in this pool.
  • Cache directives: A cache directive defines a path that should be cached. This can be either a specific file or a single directory. Note that directives are not recursive—They apply to a single directory only, not any sub-directories. So, they would usually be applied to the lowest level directory that contains the actual data files.
  • HDFS Configuration Settings: There is really only one Hadoop configuration setting that is required to turn on Centralized Caching. There are a few others to control the frequency that caching looks for new files, which you can usually leave at default. The following, which is added to the custom hdfs-site.xml, specifies the maximum number of bytes that can be cached on each datanode.

    dfs.datanode.max.locked.memory

    Remember that this value is in bytes, in contrast with the OS limits which are set in KB.

  • OS Limits: Before you implement Centralized Caching, you need to ensure that the locked memory setting on each of the datanodes is set to a value equal or greater than memory specified in the hdfs dfs.datanode.max.locked.memory. On each datanode run the following to determine the limit for locked memory. This will return a value in KB or “unlimited”.

Setting OS limits

Before you implement Centralized Caching, you need to ensure that the locked memory setting on each of the datanodes is set to a value equal or greater than memory specified in the hdfs dfs.datanode.max.locked.memory. On each datanode run the following to determine the limit for locked memory. This will return a value in KB or “unlimited”.

Copy
ulimit -l
64

Set memlock limits on each datanode.

Copy
# On each datanode (max cacheable memory in KB) example for 4.0 GB
 
echo "* hard  memlock 4194304" >> /etc/security/limits.conf
echo "* soft  memlock 4194304" >> /etc/security/limits.conf

This will take effect after you logout and login again.

Copy
ulimit -l
4194304

Setup hdfs-site.xml

Edit hdfs-site.xml and setup HDFS node cache. For example, to setup a 2GB cache:

Copy
<property>
    <name>dfs.datanode.max.locked.memory</name>
    <value>2147483648</value>
</property>

Add a cache directive

Copy
$ hdfs cacheadmin -addDirective -path /user/hadoop/warehouse/tpch_10.db/orders -pool testPool  -ttl never
Added cache directive 1

Show directives

Copy
$ hdfs cacheadmin -listDirectives -stats
Found 1 entry
 ID POOL       REPL EXPIRY  PATH                                       BYTES_NEEDED  BYTES_CACHED  FILES_NEEDED  FILES_CACHED
  1 testPool      1 never   /user/hadoop/warehouse/tpch_10.db/orders     1749195031             0             1             0

3 Monitor Hadoop

Once the Hadoop cluster is up and running check the web-ui of the components as described below:

Daemon Web Interface Notes
NameNode http://nn_host:port/ Default HTTP port is 9870
ResourceManager http://rm_host:port/ Default HTTP port is 9870
MapReduce JobHistory Server http://jhs_host:port/ Default HTTP port is 19888

You can monitor different services from command line tools. For example, to monitor map reduce tasks.

Copy
$ mapred job -list
2018-11-27 16:29:28,249 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
2018-11-27 16:29:29,231 INFO conf.Configuration: resource-types.xml not found
2018-11-27 16:29:29,231 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
Total jobs:1
                  JobId	             JobName	     State	     StartTime	    UserName	       Queue	  Priority	 UsedContainers	 RsvdContainers	 UsedMem	 RsvdMem	 NeededMem	   AM info
 job_1543325089327_0083	with q17_part as (
 	   RUNNING	 1543332459617	      hadoop	     default	   DEFAULT	              4	              0	  14336M	      0M	    14336M	http://nuc10:8088/proxy/application_1543325089327_0083/

Som useful commands are:

  • mapred job -list
  • mapred job -kill [jobid]
  • mapred job -logs [jobid]

4 Setup Yarn Resource Management

The fundamental idea of MRv2(YARN) is to split up the two major functionalities—resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM).

The ResourceManager(RM) and per-node slave, the NodeManager (NM), form the data-computation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system.

Let's see how to configure YARN resource management.

4.1 Yarn memory

YARN can manage 3 system resources—memory, CPU and disks. You can tune this parameters in yarn-site.xml.

Property Default Tunned
yarn.nodemanager.resource.memory-mb . 16000
yarn.nodemanager.resource.cpu-vcores . 6
yarn.nodemanager.resource.io-spindles . .

To view the available resources from each node, you can go to RM UI(http://<IP_of_RM>:8088/cluster/nodes), and find out the “Mem Avail”, “Vcores Avail” and “Disk Avail” from each node.

4.2 Minimum and maximum allocation unit in YARN

Two resources—memory and CPU, have minimum and maximum allocation unit in YARN, as set by the configurations below in yarn-site.xml.

Property Default Tunned
yarn.scheduler.minimum-allocation-mb 1024 .
yarn.scheduler.maximum-allocation-mb 8192 .
yarn.scheduler.minimum-allocation-vcores 1 .
yarn.scheduler.maximum-allocation-vcores 32 .

Basically, it means RM can only allocate memory to containers in increments of "yarn.scheduler.minimum-allocation-mb" and not exceed "yarn.scheduler.maximum-allocation-mb";

And it can only allocate CPU vcores to containers in increments of "yarn.scheduler.minimum-allocation-vcores" and not exceed "yarn.scheduler.maximum-allocation-vcores".

If changes required, set above configurations in yarn-site.xml on RM nodes, and restart RM.

For example, if one job is asking for 1025 MB memory per map container(set mapreduce.map.memory.mb=1025), RM will give it one 2048 MB(2*yarn.scheduler.minimum-allocation-mb) container.

4.3 Virtual/physical memory checker

NodeManager can monitor the memory usage(virtual and physical) of the container. If its virtual memory exceeds “yarn.nodemanager.vmem-pmem-ratio” times the "mapreduce.reduce.memory.mb" or "mapreduce.map.memory.mb", then the container will be killed if “yarn.nodemanager.vmem-check-enabled” is true;

If its physical memory exceeds "mapreduce.reduce.memory.mb" or "mapreduce.map.memory.mb", the container will be killed if “yarn.nodemanager.pmem-check-enabled” is true.

The parameters below can be set in yarn-site.xml on each NM nodes to override the default behavior.

Property Default Tunned
yarn.nodemanager.vmem-check-enabled false .
yarn.nodemanager.pmem-check-enabled true .
yarn.nodemanager.vmem-pmem-ratio 2.1 .

4.4 Mapper,Reducer and AM’s resource request

MapReduce v2 job has 3 different container types—Mapper, Reducer and AM (Application Master).

Mapper and Reducer can ask for resources—memory, CPU and disk, while AM can only ask for memory and CPU.

Below are a summary of the configurations of resource requests for the three container types.

The default values can be overridden in mapred-site.xml on the client node or set in applications like MapReduce java code, Pig and Hive Cli,etc.

Job type Property Default Tunned
Mapper mapreduce.map.memory.mb 1024 2048
mapreduce.map.java.opts -Xmx900m .
mapreduce.map.cpu.vcores 1 .
mapreduce.map.disk 0.5 .
Reducer mapreduce.reduce.memory.mb 3072 4096
mapreduce.reduce.java.opts -Xmx2560m .
mapreduce.reduce.cpu.vcores 1 .
mapreduce.reduce.disk 1.33 .
AM yarn.app.mapreduce.am.resource.mb 1536 .
yarn.app.mapreduce.am.command-opts -Xmx1024m .
yarn.app.mapreduce.am.resource.cpu-vcores 1 .

Each container is actually a JVM process, and above “-Xmx” of java-opts should fit in the allocated memory size. One best practice is to set it to 0.8 * (container memory allocation). For example, if the requested mapper container has mapreduce.map.memory.mb=4096, we can set mapreduce.map.java.opts=-Xmx3277m.

There are many factors which can affect the memory requirement for each container. Such factors include the number of Mappers/Reducers, the file type(plain text file , parquet, ORC), data compression algorithm, type of operations(sort, group-by, aggregation, join), data skew, etc. You should be familiar with the nature of this MapReduce job and figure out the minimum requirement for Mapper,Reducer and AM. Any type of the container can run out of memory and be killed by physical/virtual memory checker, if it doesn't meet the minimum memory requirement. If so, you need to check the AM log and the failed container log to find out the cause.

For example, if the MapReduce job sorts parquet files, Mapper needs to cache the whole Parquet row group in memory. I have done tests to prove that the larger the row group size of parquet files is, the larger Mapper memory is needed. In this case, make sure the Mapper memory is large enough without triggering OOM.

Another example is AM running out of memory. Normally, AM’s 1G java heap size is enough for many jobs. However, if the job is to write lots of parquet files, during commit phase of the job, AM will call ParquetOutputCommitter.commitJob(). It will first read footers of all output parquet files, and write the metadata file named “_metadata” in output directory.

You can read more about how Yarn memory should be configured here

5 Test Hadoop

Hadoop installation contains a number of sample mapreduce jobs. We will run one of them to verify that our hadoop installation is working fine.

  1. Login as hadoop
  2. We will first generate and copy a file from local system to the hdfs home folder.
    Copy
    cat > /tmp/words.txt
    hello
    bye
    mark
    mary
    mark
    alfred
    ^D
  3. Copy the file to your HDFS root
    Copy
    $ hdfs dfs -copyFromLocal /tmp/words.txt .
  4. Let us run a mapreduce program on this HDFS file words.txt to find the number of occurrences of the word "mark" in the file. The results will be place in HDFS folder output.
    A mapreduce program for word count is available in the Hadoop samples.
    Copy
    $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.3.jar grep words.txt output 'mark'
    18/02/10 00:01:08 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
    18/02/10 00:01:09 INFO input.FileInputFormat: Total input files to process : 1
    18/02/10 00:01:09 INFO mapreduce.JobSubmitter: number of splits:1
    18/02/10 00:01:10 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1518216494723_0003
    18/02/10 00:01:10 INFO impl.YarnClientImpl: Submitted application application_1518216494723_0003
    18/02/10 00:01:10 INFO mapreduce.Job: The url to track the job: http://iMac-test.local:8088/proxy/application_1518216494723_0003/
    18/02/10 00:01:10 INFO mapreduce.Job: Running job: job_1518216494723_0003
    18/02/10 00:01:19 INFO mapreduce.Job: Job job_1518216494723_0003 running in uber mode : false
    18/02/10 00:01:19 INFO mapreduce.Job:  map 0% reduce 0%
    18/02/10 00:01:25 INFO mapreduce.Job:  map 100% reduce 0%
    ...

    This runs the mapreduce on the hdfs file uploaded earlier and then outputs the results to the output folder inside the hdfs home folder. The file will be named as part-r-00000. This can be downloaded from the name node browser console or run the following command to copy it to the local folder.

  5. Now, change to a working directory and download HDFS output folder to look into results.
    Copy
    $ cd /tmp
    $ hdfs dfs -get output/* .
    $ cd output
    $ ls -l
    total 8
    -rw-r--r--  1 deister  wheel  0 10 feb 00:14 _SUCCESS
    -rw-r--r--  1 deister  wheel  7 10 feb 00:14 part-r-00000
  6. Finally, check the results of the word count job searhing word mark
    Copy
    $ cat part-r-00000
    2	mark

6 Install Hive

To read about hive SQL click here

6.1 Downlod and install

  1. Login as root and go to /home directory. Seach for a Hive repository and download it.
    Copy
    # wget http://apache.rediris.es/hive/hive-3.1.1/apache-hive-3.1.1-bin.tar.gz
  2. Extract Hive archive
    Copy
    # tar xvfz apache-hive-3.1.1-bin.tar.gz
    apache-hive-3.1.1-bin/LICENSE
    apache-hive-3.1.1-bin/RELEASE_NOTES.txt
    apache-hive-3.1.1-bin/NOTICE
    apache-hive-3.1.1-bin/binary-package-licenses/com.thoughtworks.paranamer-LICENSE
    apache-hive-3.1.1-bin/binary-package-licenses/org.codehaus.janino-LICENSE
    apache-hive-3.1.1-bin/binary-package-licenses/org.jamon.jamon-runtime-LICENSE
    apache-hive-3.1.1-bin/binary-package-licenses/org.mozilla.rhino-LICENSE
    apache-hive-3.1.1-bin/binary-package-licenses/org.jruby-LICENSE
    ...
  3. Add user hive (but emove it's directory cause we will use apache hive directory)
    Copy
    # useradd hive    
    # rmdir hive
  4. Create a symbolink link
    Copy
    # ln -s /home/apache-hive-3.1.1-bin hive
  5. Change owner and group to hive
    Copy
    # chown -R hive:hive /home/apache-hive-3.1.1-bin
  6. Login as hive
    Copy
    # su - hive
  7. Create the .bashrc file
    Copy
    $ cat > .bashrc
    # .bashrc
    
    # Source global definitions
    if [ -f /etc/bashrc ]; then
    	. /etc/bashrc
    fi
    ^D
  8. Create a .bash_profile adding path to Hadoop and Hive binaries
    Copy
    $ cat > .bash_profile
    
    # .bash_profile
    
    # Get the aliases and functions
    if [ -f ~/.bashrc ]; then
    	. ~/.bashrc
    fi

6.2 Run hive from hadoop account

Now, we will test is correcly installed.

  1. Login as hadoop and setup the path to hive. Edit the .bash_profile and add
    Copy
    export PATH="/home/hive/bin:${PATH}"
  2. Create the metastore db. This will create metastore_db directory that contains Hive metadata.
    Copy
    $  schematool -initSchema -dbType derby
  3. Ensure the version of guava is the same for Hive and Hadoop.
    1. Go to $HIVE_HOME (%HIVE_HOME%)/lib folder and find out the version of guava. For Hive 3.0.0, it is guava-19.0.jar.
    2. Go to $HADOOP_HOME (%HADOOP_HOME%)/share/hadoop/common/lib folder and find out the version of guava. For Hadoop 3.2.1, the version is guava-27.0-jre.jar.
    3. If they are not same (which is true for this case), delete the older version and copy the newer version in both. In this case, delete guava-19.0.jar in Hive lib folder, and then copy guava-27.0-jre.jar from Hadoop folder to Hive.
  4. Now you are ready to run hive
    Copy
    $ hive
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/home/apache-hive-3.1.1-bin/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/home/hadoop-3.0.3/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
    Hive Session ID = 65408e69-cd89-469f-bcca-410bc8447b80
  5. Run the command show databases.
    Copy
    hive> show databases;
    default
    Time taken: 0.592 seconds, Fetched: 1 row(s)
    hive>
    OK
We have succcesfully connected to hive. Now we need to configure it.

6.3 Debug Hive

You can start hive with debug option.

Copy
$ hive --hiveconf hive.execution.engine=mr  --hiveconf hive.root.logger=DEBUG,console

7 Setup Hive

Login as haddop (we will run hive from hadoop account to simplify the need to set HDFS user permissions for hive account)

7.1 Hive metastore

Configuring metastore means specifying to Hive where the database is stored.

All hive implementation need a metastore service, where it stores metadata. It is implemented using tables in relational database. By default, Hive uses built-in Derby SQL server. It provides single process storage, so when we use Derby we can not run instance of Hive CLI. Whenever we want to run Hive on a personal machine or for some developer task than it is good but when we want to use it on cluster then MYSQL or any other similar relational database is required.

Now when you run your hive query and you are using default derby database you will find that your current directory now contains a new sub-directory metastore_db. Also the metastore will be created if it doesn’t already exist.

The property of interest here is javax.jdo.option.ConnectionURL. The default value of this property is
jdbc:derby:;databaseName=metastore_db;create=true.

This value specifies that you will be using embedded derby as your Hive metastore and the location of the metastore is metastore_db.

7.1.1 Config hive-site.xml

  • Copy the hive-default-xml template as hive-site.xml
    Copy
    $ cp conf/hive-default.xml.template conf/hive-site.xml
  • Set the default properties for tmpdir and user name on top of $HOME/hive/conf/hive-site.xml
    Copy
    <property>
        <name>system:java.io.tmpdir</name>
        <value>/tmp/${user.name}/java</value>
        </property>
    <property>
        <name>system:user.name</name>
        <value>${user.name}</value>
    </property>
    If you miss to setup this values you will get an exception when running hive.
    java.net.URISyntaxException: Relative path in absolute URI: ${system:java.io.tmpdir%7D/$%7Bsystem:user.name%7D    
    
  • We can also configure directory for hive store table information. By default, the location of warehouse is /user/${user.name}/warehouse as specified in hive-site.xml.
    Copy
    <property>
        <name>hive.metastore.warehouse.dir</name>
        <value>/user/${user.name}/warehouse</value>
        <description>location of default database for the warehouse</description>
    </property>
  • Notice this is a location pointing to HDFS, so it must exists before you plan to create any database, so create the warehouse directory for hive in HDFS
    Copy
    $ hadoop fs -mkdir /user/hadoop/warehouse
    $ hdfs dfs -ls -R /
    drwxr-xr-x   - hadoop supergroup          0 2018-11-22 18:50 /user
    drwxr-xr-x   - hadoop supergroup          0 2018-11-22 18:50 /user/hadoop
    drwxr-xr-x   - hadoop supergroup          0 2018-11-22 18:50 /user/hadoop/warehouse

7.2 Tunning Hive

Hive automtically determines number of reducers based on the following formula

$$reducers = \frac{bytes.of.input.to.mappers}{hive.exec.reducers.bytes.per.reducer}$$

You can limit the number of reducers produced by this heuristic using hive.exec.reducers.max.

If you know exactly the number of reducers you want, you can set mapred.reduce.tasks, and this will override all heuristics. (By default this is set to -1, indicating Hive should use its heuristics.)

8 Hive+Tez instead of MR

Tez is a DAG (Directed acyclic graph) architecture. Tez generalizes the MapReduce paradigm to a more powerful framework based on expressing computations as a dataflow graph. Hive can be run in Tez instead of Map Reduce.

A typical Map reduce job has following steps:

  1. Read data from file --> one disk access
  2. Run mappers
  3. Write map output --> second disk access
  4. Run shuffle and sort --> read map output, third disk access
  5. write shuffle and sort --> write sorted data for reducers --> fourth disk access
  6. Run reducers which reads sorted data --> fifth disk output
  7. Write reducers output --> sixth disk access

Tez works very similar to Spark (Tez was created by Hortonworks well before Spark):

  1. Execute the plan but no need to read data from disk.
  2. Once ready to do some calculations (similar to actions in spark), get the data from disk and perform all steps and produce output.

Only one read and one write.

Notice the efficiency introduced by not going to disk multiple times. Intermediate results are stored in memory (not written to disks). On top of that there is vectorization (process batch of rows instead of one row at a time). All this adds to efficiencies in query time.

You can read more about Tez configuration here

8.1 Setup Tez

TO DO

This section is incomplete and will be concluded as soon as possible.

9 Load data

9.1 Genetate TPCH files

For example, to generate a scale 100 run.

Copy
export DSS_PATH=/data/TPCH/data100
mkdir $DSS_PATH
./dbgen -s 100 -f

9.2 Start hive

Start hive and check everything is working properly.

Copy
$ hive
hive >

9.3 Create database and tables

Execute hive and run SQL Commands to create TPCH database and tables. Hive supports several storage formats:

  • TEXTFILE: if your data is delimited by some parameters
  • ORCFILE: (Optimized row columnar) if you want to store your data in an optimized way which lessens your storage and increases your performance.
  • RCFILE: (Record Columnar File) if you want to perform analytics on your data and you want to store your data efficiently.
  • SEQUENCEFILE: if your data is in small files whose size is less than the block size.

Copy
DROP DATABASE IF EXISTS tpch_100 CASCADE;
CREATE DATABASE tpch_100;
USE tpch_100;

CREATE TABLE IF NOT EXISTS nation (
    n_nationkey integer ,
    n_name char(25) ,
    n_regionkey integer ,
    n_comment varchar(152) 
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|';

CREATE TABLE IF NOT EXISTS region (
    r_regionkey integer ,
    r_name char(25) ,
    r_comment varchar(152) 
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|';

CREATE TABLE IF NOT EXISTS part (
    p_partkey integer  ,
    p_name varchar(55)  ,
    p_mfgr char(25)  ,
    p_brand char(10)  ,
    p_type varchar(25)  ,
    p_size integer  ,
    p_container char(10) ,
    p_retailprice decimal(15,2) ,
    p_comment varchar(23) 
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|';

CREATE TABLE IF NOT EXISTS supplier (
    s_suppkey integer  ,
    s_name char(25)  ,
    s_address varchar(40) ,
    s_nationkey integer  ,
    s_phone char(15) ,
    s_acctbal decimal(15,2) ,
    s_comment varchar(101) 
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|';

CREATE TABLE IF NOT EXISTS partsupp (
    ps_partkey integer ,
    ps_suppkey integer ,
    ps_availqty integer ,
    ps_supplycost decimal(15,2) ,
    ps_comment varchar(199) 
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|';

CREATE TABLE IF NOT EXISTS customer (
    c_custkey integer  ,
    c_name varchar(25)  ,
    c_address varchar(40) ,
    c_nationkey integer  ,
    c_phone char(15)  ,
    c_acctbal decimal(15,2)  ,
    c_mktsegment char(10)  ,
    c_comment varchar(117) 
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|';

CREATE TABLE IF NOT EXISTS orders (
    o_orderkey integer ,
    o_custkey integer ,
    o_orderstatus char(1) ,
    o_totalprice decimal(15,2) ,
    o_orderdate date  ,
    o_orderpriority char(15)  ,
    o_clerk char(15)  ,
    o_shippriority integer  ,
    o_comment varchar(79)  
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|';

CREATE TABLE IF NOT EXISTS lineitem (
    l_orderkey integer  ,
    l_partkey integer  ,
    l_suppkey integer  ,
    l_linenumber integer  ,
    l_quantity decimal(15,2)  ,
    l_extendedprice decimal(15,2)  ,
    l_discount decimal(15,2)  ,
    l_tax decimal(15,2)  ,
    l_returnflag char(1)  ,
    l_linestatus char(1)  ,
    l_shipdate date  ,
    l_commitdate date  ,
    l_receiptdate date  ,
    l_shipinstruct char(25)  ,
    l_shipmode char(10)  ,
    l_comment varchar(44) 
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|';
Copy
DROP DATABASE IF EXISTS tpch_100 CASCADE;
CREATE DATABASE tpch_100;
USE tpch_100;


create table region
stored as ${FILE}
TBLPROPERTIES('orc.bloom.filter.columns'='*','orc.compress'='ZLIB')
as select distinct * from ${SOURCE}.region;


create table nation
stored as ${FILE}
TBLPROPERTIES('orc.bloom.filter.columns'='*','orc.compress'='ZLIB')
as select distinct * from ${SOURCE}.nation;

create table customer
stored as ${FILE}
TBLPROPERTIES('orc.bloom.filter.columns'='*','orc.compress'='ZLIB')
as select * from ${SOURCE}.customer
cluster by C_MKTSEGMENT;

create table supplier
stored as ${FILE}
TBLPROPERTIES('orc.bloom.filter.columns'='*','orc.compress'='ZLIB')
as select * from ${SOURCE}.supplier
cluster by s_nationkey, s_suppkey;

create table part
stored as ${FILE}
TBLPROPERTIES('orc.bloom.filter.columns'='*','orc.compress'='ZLIB')
as select * from ${SOURCE}.part
cluster by p_brand;


create table partsupp
stored as ${FILE}
TBLPROPERTIES('orc.bloom.filter.columns'='*','orc.compress'='ZLIB')
as select * from ${SOURCE}.partsupp
cluster by PS_SUPPKEY;


create table orders (O_ORDERKEY BIGINT,
 O_CUSTKEY BIGINT,
 O_ORDERSTATUS STRING,
 O_TOTALPRICE DOUBLE,
 O_ORDERPRIORITY STRING,
 O_CLERK STRING,
 O_SHIPPRIORITY INT,
 O_COMMENT STRING)
 partitioned by (O_ORDERDATE STRING)
stored as ${FILE}
;

ALTER TABLE orders SET TBLPROPERTIES('orc.bloom.filter.columns'='*','orc.compress'='ZLIB');

INSERT OVERWRITE TABLE orders partition(O_ORDERDATE)
select 
O_ORDERKEY ,
 O_CUSTKEY ,
 O_ORDERSTATUS ,
 O_TOTALPRICE ,
 O_ORDERPRIORITY ,
 O_CLERK ,
 O_SHIPPRIORITY ,
 O_COMMENT,
 O_ORDERDATE
  from ${SOURCE}.orders
;

create table lineitem 
(L_ORDERKEY BIGINT,
 L_PARTKEY BIGINT,
 L_SUPPKEY BIGINT,
 L_LINENUMBER INT,
 L_QUANTITY DOUBLE,
 L_EXTENDEDPRICE DOUBLE,
 L_DISCOUNT DOUBLE,
 L_TAX DOUBLE,
 L_RETURNFLAG STRING,
 L_LINESTATUS STRING,
 L_COMMITDATE STRING,
 L_RECEIPTDATE STRING,
 L_SHIPINSTRUCT STRING,
 L_SHIPMODE STRING,
 L_COMMENT STRING)
 partitioned by (L_SHIPDATE STRING)
stored as ${FILE}
;

ALTER TABLE lineitem SET TBLPROPERTIES('orc.bloom.filter.columns'='*','orc.compress'='ZLIB');

INSERT OVERWRITE TABLE lineitem Partition(L_SHIPDATE)
select 
L_ORDERKEY ,
 L_PARTKEY ,
 L_SUPPKEY ,
 L_LINENUMBER ,
 L_QUANTITY ,
 L_EXTENDEDPRICE ,
 L_DISCOUNT ,
 L_TAX ,
 L_RETURNFLAG ,
 L_LINESTATUS ,
 L_COMMITDATE ,
 L_RECEIPTDATE ,
 L_SHIPINSTRUCT ,
 L_SHIPMODE ,
 L_COMMENT ,
 L_SHIPDATE
 from ${SOURCE}.lineitem
;


analyze table nation compute statistics for columns;
analyze table region compute statistics for columns;
analyze table supplier compute statistics for columns;
analyze table part compute statistics for columns;
analyze table partsupp compute statistics for columns;
analyze table customer compute statistics for columns;
analyze table orders compute statistics for columns;
analyze table lineitem compute statistics for columns;

9.4 Load data

From hive shell run the following commands to load TPCH 100 generated files:

Copy
LOAD DATA LOCAL INPATH '/data/TPCH/data100/nation.tbl' OVERWRITE INTO TABLE nation;
LOAD DATA LOCAL INPATH '/data/TPCH/data100/region.tbl' OVERWRITE INTO TABLE region;
LOAD DATA LOCAL INPATH '/data/TPCH/data100/supplier.tbl' OVERWRITE INTO TABLE supplier;
LOAD DATA LOCAL INPATH '/data/TPCH/data100/customer.tbl' OVERWRITE INTO TABLE customer;
LOAD DATA LOCAL INPATH '/data/TPCH/data100/part.tbl' OVERWRITE INTO TABLE part;
LOAD DATA LOCAL INPATH '/data/TPCH/data100/partsupp.tbl' OVERWRITE INTO TABLE partsupp;
LOAD DATA LOCAL INPATH '/data/TPCH/data100/orders.tbl' OVERWRITE INTO TABLE orders;
LOAD DATA LOCAL INPATH '/data/TPCH/data100/lineitem.tbl' OVERWRITE INTO TABLE lineitem;
Table SF=100 SF=1000
Rows Size
GB
Time Rows Size
GB
Time
region 1 0 00:00 . 0 .
nation 1 0 00:00 . 0 .
supplier 1M 0.14 00:02 . 0 .
customer 15M 2.3 00:03 . 0 .
part 20M 2.6 00:01 . 0 .
partsupp 80M 11 00:13 . 0 .
orders 150M 15 00:25 . 0 .
lineitem 600M 64 01:08 . 0 .
Total table load 00:01:52 ?

9.5 Using ORC storage

To use ORC storage, first create a table to load delimited file, the copy to ORC format.

Copy
CREATE TABLE IF NOT EXISTS lineitem_ext (
    l_orderkey integer  ,
    l_partkey integer  ,
    l_suppkey integer  ,
    l_linenumber integer  ,
    l_quantity decimal(15,2)  ,
    l_extendedprice decimal(15,2)  ,
    l_discount decimal(15,2)  ,
    l_tax decimal(15,2)  ,
    l_returnflag char(1)  ,
    l_linestatus char(1)  ,
    l_shipdate date  ,
    l_commitdate date  ,
    l_receiptdate date  ,
    l_shipinstruct char(25)  ,
    l_shipmode char(10)  ,
    l_comment varchar(44) 
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|';

LOAD DATA LOCAL INPATH '/data/TPCH/data100/lineitem.tbl' OVERWRITE INTO TABLE lineitem;

create table if not exists lineitem 
(L_ORDERKEY BIGINT,
 L_PARTKEY BIGINT,
 L_SUPPKEY BIGINT,
 L_LINENUMBER INT,
 L_QUANTITY DOUBLE,
 L_EXTENDEDPRICE DOUBLE,
 L_DISCOUNT DOUBLE,
 L_TAX DOUBLE,
 L_RETURNFLAG STRING,
 L_LINESTATUS STRING,
 L_SHIPDATE STRING,
 L_COMMITDATE STRING,
 L_RECEIPTDATE STRING,
 L_SHIPINSTRUCT STRING,
 L_SHIPMODE STRING,
 L_COMMENT STRING)
STORED AS ORC TBLPROPERTIES ("orc.compress"="SNAPPY")
;

9.6 Creating statistics

Statistics such as the number of rows of a table or partition and the histograms of a particular interesting column are important in many ways. One of the key use cases of statistics is query optimization. Statistics serve as the input to the cost functions of the optimizer so that it can compare different plans and choose among them.

Copy
analyze table region compute statistics;
analyze table nation compute statistics;
analyze table supplier compute statistics;
analyze table customer compute statistics;
analyze table part compute statistics;
analyze table partsupp compute statistics;
analyze table orders compute statistics;
analyze table lineitem compute statistics;

10 Running querys

Now you can run Hive SQL querys using either MR or Tez.

TO DO

This section is incomplete and will be concluded as soon as possible.

11 Hadoop cluster

At etc/hadoop/workers set the workers (slaves) nodes.

Copy the content from the master at etc/hadoop to the rest of nodes.

TO DO

This section is incomplete and will be concluded as soon as possible.