This document describes a test of TPCH benchmark at low scales (10, 20, 50 and 100) using Hadoop on single NUC computer, plus a final test for scale 1000 on a cluster of 10 NUC computers (10 times scale 100).
You can read more about Hadoop TPCH test on the following documents:
1 Install Haddoop
1.1 Install OpenJDK 8 JDK
To install OpenJDK 8 JDK using yum, run this command as root:
# yum install java-1.8.0-openjdk-devel
Afeter completed, check for java
# java -version
openjdk version "1.8.0_191"
OpenJDK Runtime Environment (build 1.8.0_191-b12)
OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
1.2 Download Hadoop release
With Java in place, we'll visit the Apache Hadoop Releases page to find the most recent stable release. Follow the binary for the current release.

Click on binary
for the desired release. You will be redirected to the download page.
You can use wget
on a given release to directly download the archive.
# cd /home # wget http://apache.uvigo.es/hadoop/common/hadoop-3.0.3/hadoop-3.0.3.tar.gz
--2018-11-22 14:22:13-- http://apache.uvigo.es/hadoop/common/hadoop-3.0.3/hadoop-3.0.3.tar.gz
Resolving apache.uvigo.es (apache.uvigo.es)... 193.146.32.74, 2001:720:1214:4200::74
Connecting to apache.uvigo.es (apache.uvigo.es)|193.146.32.74|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 314322972 (300M) [application/x-gzip]
Saving to: ‘hadoop-3.0.3.tar.gz’
100%[=======================================================================>] 314,322,972 7.46MB/s in 41s
2018-11-22 14:22:54 (7.38 MB/s) - ‘hadoop-3.0.3.tar.gz’ saved [314322972/314322972]
1.3 Extract files from archive
Extract hadoop fron file. This will create a directory with hadoop version number.
# cd /home # tar xvzf hadoop-3.0.3.tar.gz
hadoop-3.0.3/
hadoop-3.0.3/LICENSE.txt
hadoop-3.0.3/NOTICE.txt
hadoop-3.0.3/README.txt
hadoop-3.0.3/bin/
hadoop-3.0.3/bin/hadoop
...
1.4 Create a hadoop user account
From root, create the hadoop user accoount.
# useradd hadoop
Asumming we have installed Hadoop in /home, we can create a link for hadoop user as:
# cd /home # ln -s /home/hadoop-3.0.3 /home/hadoop
Now, ensure all files are owned by hadoop user.
# cd /home # chown -R hadoop:hadoop hadoop-3.0.3
1.5 Login as hadoop
Ensure everithing is correct by login as hadoop.
# su - hadoop $ hadoop@nuc10 $ ls -l
ls -l
total 176
drwxr-xr-x 2 hadoop hadoop 183 May 31 19:36 bin
drwxr-xr-x 3 hadoop hadoop 20 May 31 19:13 etc
drwxr-xr-x 2 hadoop hadoop 106 May 31 19:36 include
drwxr-xr-x 3 hadoop hadoop 20 May 31 19:36 lib
drwxr-xr-x 4 hadoop hadoop 288 May 31 19:36 libexec
-rw-rw-r-- 1 hadoop hadoop 147066 May 29 19:58 LICENSE.txt
-rw-rw-r-- 1 hadoop hadoop 20891 May 29 19:58 NOTICE.txt
-rw-r--r-- 1 hadoop hadoop 1366 May 26 20:43 README.txt
drwxr-xr-x 3 hadoop hadoop 4096 May 31 19:13 sbin
drwxr-xr-x 4 hadoop hadoop 31 May 31 19:50 share
Create .bash_profile in /home/hadoop to setup bash prompt if need.
$ cat > .bash_profile # .bash_profile # Get the aliases and functions if [ -f ~/.bashrc ]; then . ~/.bashrc fi ^D
1.6 Configuring Hadoop's Java Home
Hadoop requires that you set the path to Java, either as an environment variable or in the Hadoop configuration file.
The path to Java, /usr/bin/java is a symlink to /etc/alternatives/java, which is in turn a symlink to default Java binary. We will use readlink with the -f flag to follow every symlink in every part of the path, recursively. Then, we'll use sed to trim bin/java from the output to give us the correct value for JAVA_HOME.
To find the default Java path
readlink -f /usr/bin/java | sed "s:bin/java::"
/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.191.b12-0.el7_5.x86_64/jre/
You can copy this output to set Hadoop's Java home to this specific version, which ensures that if the default Java changes, this value will not. Alternatively, you can use the readlink command dynamically in the file so that Hadoop will automatically use whatever Java version is set as the system default.
To begin, open hadoop-env.sh:
$ vi /home/hadoop/etc/hadoop/hadoop-env.sh
1.6.1 Option 1: Set a Static Value
#export JAVA_HOME=${JAVA_HOME} export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/ <section title="Option 2: Use Readlink to Set the Value Dynamically">
1.6.2 Option 2: Use Readlink to Set the Value Dynamically
#export JAVA_HOME=${JAVA_HOME} export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
1.7 Verify Hadoop installation
Now we should be able to run Hadoop. Login as hadoop and run:
$ bin/hadoop
Usage: hadoop [OPTIONS] SUBCOMMAND [SUBCOMMAND OPTIONS]
or hadoop [OPTIONS] CLASSNAME [CLASSNAME OPTIONS]
where CLASSNAME is a user-provided Java class
OPTIONS is none or any of:
...
The help means we've successfully configured Hadoop to run in stand-alone mode.
2 Setup Hadoop
Now, as we have Hadoop properly installed we will proceed to configure it.
2.1 Configure ssh
You need to enable ssh on hadoop machine. Ensure ssh is enabled.
$ sudo systemsetup -getremotelogin
Remote Login: On
When hadoop is installed in distributed mode, it uses a password less SSH for master to slave communication. To enable SSH daemon in mac, go to System Preferences => Sharing. Then click on Remote Login to enable SSH. Execute the following commands on the terminal to enable password less login to SSH:
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys chmod 600 ~/.ssh/authorized_keys
Now you can run without the need of entering password a simple command like:
$ ssh localhost ls
Applications
Desktop
Documents
Downloads
...
2.2 Configure Hadoop
Modify various Hadoop configuration files to properly setup Hadoop and yarn.
These files are located under etc/hadoop
directory.
2.2.1 Shell environment variables
.bash_profile
export HADOOP_HOME=/home/hadoop export HADOOP_HDFS_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_YARN_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export HADOOP_CONF_DIR=$HADOOP_HOME export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
2.2.2 Haddop configuration files
At core-site.xml configure the master node.
etc/hadoop/core-site.xml
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://master-node:9000</value> </property> </configuration>
As for example:
etc/hadoop/core-site.xml
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://nuc10:9000</value> </property> </configuration>
etc/hadoop/hdfs-site.xml
<property> <name>dfs.name.dir</name> <value>/data/hadoop/hdfs/name</value> <final>true</final> </property> <property> <name>dfs.data.dir</name> <value>/data/hadoop/hdfs/data</value> <final>true</final> </property> <property> <name>dfs.replication</name> <value>1</value> </property>
etc/hadoop/mapred-site.xml
<property> <name>yarn.app.mapreduce.am.env</name> <value>HADOOP_MAPRED_HOME=$HADOOP_MAPRED_HOME</value> </property> <property> <name>mapreduce.map.env</name> <value>HADOOP_MAPRED_HOME=$HADOOP_MAPRED_HOME</value> </property> <property> <name>mapreduce.reduce.env</name> <value>HADOOP_MAPRED_HOME=$HADOOP_MAPRED_HOME</value> </property> <!-- required to run mapreduce on master node --> <property> <name>mapreduce.application.classpath</name> <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*,$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/common/*,$HADOOP_MAPRED_HOME/share/hadoop/common/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/yarn/*,$HADOOP_MAPRED_HOME/share/hadoop/yarn/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/hdfs/*,$HADOOP_MAPRED_HOME/share/hadoop/hdfs/lib/*</value> </property> </configuration>
At yarn-site.xml configure the yarn.resourcemanager.hostname to the master-node.
etc/hadoop/yarn-site.xml
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.env-whitelist</name> <value>JAVA_HOME, HADOOP_COMMON_HOME, HADOOP_HDFS_HOME, HADOOP_CONF_DIR, CLASSPATH_PREPEND_DISTCACHE, HADOOP_YARN_HOME, HADOOP_MAPRED_HOME </value> </property> <property> <name>yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage </name> <value>85.0</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <value>nuc10</value> </property> </configuration>
Note the use of disk utilization threshold above. This tells yarn to continue operations when disk utilization is below 85.0%.
The default value for this is 90%. If disk utilization goes above the configured threshold,
yarn will report the node instance as unhealthy nodes with error "local-dirs are bad"
.
2.3 Change location of hadoop data
To change storage location to appropriate filesystem, edit the file:
etc/hadoop/hdfs-site.xml
and set the dfs.name.dir
and dfs.data.dir
properties according.
<configuration> <property> <name>dfs.name.dir</name> <value>/data/hadoop/hdfs/name</value> <final>true</final> </property> <property> <name>dfs.data.dir</name> <value>/data/hadoop/hdfs/data</value> <final>true</final> </property> <property> <name>dfs.replication</name> <value>3</value> </property> </configuration>
2.4 Configure Slaves
Execute the following commands on the master-node (nuc10).
The file workers is used by startup scripts to start required daemons on all nodes. Edit etc/hadoop/slaves
to be:
etc/hadoop/workers
nuc00 nuc01 nuc02 nuc03 nuc04 nuc05 nuc06 nuc07 nuc08 nuc00
2.5 Initialize Hadoop Cluster
Login as hadoop
$ ls -l
ls -l
total 176
drwxr-xr-x 2 hadoop hadoop 183 May 31 19:36 bin
drwxr-xr-x 3 hadoop hadoop 20 May 31 19:13 etc
drwxr-xr-x 2 hadoop hadoop 106 May 31 19:36 include
drwxr-xr-x 3 hadoop hadoop 20 May 31 19:36 lib
drwxr-xr-x 4 hadoop hadoop 288 May 31 19:36 libexec
-rw-rw-r-- 1 hadoop hadoop 147066 May 29 19:58 LICENSE.txt
drwxrwxr-x 2 hadoop hadoop 39 Nov 22 16:47 logs
-rw-rw-r-- 1 hadoop hadoop 20891 May 29 19:58 NOTICE.txt
-rw-r--r-- 1 hadoop hadoop 1366 May 26 20:43 README.txt
drwxr-xr-x 3 hadoop hadoop 4096 May 31 19:13 sbin
drwxr-xr-x 4 hadoop hadoop 31 May 31 19:50 share
Run the following command to initialize the metadata for the hadoop cluster. This formats the hdfs file system and configures it on the local system.
/tmp/hadoop-<username>
folder.
It is possible to modify the default location of name node configuration by
configuring the hdfs-site.xml
file.
Similarly the hdfs data block storage location can be changed using dfs.data.dir
property.
Copy the content from the master at etc/hadoop to the rest of nodes.
$ hdfs namenode -format
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = nuc10/192.168.9.180
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 3.0.3
STARTUP_MSG: classpath = /home/hadoop-3.0.3/etc/hadoop:/....
...
...
2018-11-22 16:47:03,930 INFO common.Storage: Storage directory /tmp/hadoop-hadoop/dfs/name has been successfully formatted.
2018-11-22 16:47:03,937 INFO namenode.FSImageFormatProtobuf: Saving image file /tmp/hadoop-hadoop/dfs/name/current/fsimage.ckpt_0000000000000000000 using no compression
2018-11-22 16:47:03,991 INFO namenode.FSImageFormatProtobuf: Image file /tmp/hadoop-hadoop/dfs/name/current/fsimage.ckpt_0000000000000000000 of size 391 bytes saved in 0 seconds .
2018-11-22 16:47:03,999 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
2018-11-22 16:47:04,002 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at nuc10/192.168.9.180
************************************************************/
2.6 Start Hadoop Cluster
Configure passwordless ssh, need to add user machine's public key to server machines ~/.ssh/authorized_keys file. In this case, both the system are same machines. First of all, logged as haddop user, generate a private and public key:
$ su - hadoop $ ssh-keygen
Finally, copy the public key to the authorized_keys file:
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Check the file permissions are " -rw-------", if not then execute:
$ chmod 600 ~/.ssh/authorized_keys
Run the following command from terminal (after switching to hadoop home folder) to start the Hadoop cluster. This starts name node and data node on the local system.
$ sbin/start-all.sh
Starting namenodes on [localhost]
Starting datanodes
Starting secondary namenodes [nuc10]
Starting resourcemanager
Starting nodemanagers
To verify that the namenode and datanode daemons are running, execute the following command on the terminal. This displays running Java processes on the system.
$ jps
7897 DataNode
1922 Jps
726 NameNode
1400 ResourceManager
1115 SecondaryNameNode
1611 NodeManager
We must see a datanode and namenode on same server as we are deploying a single node hadoop. When running on cluster, namenode with not contain a datanode.
2.6.1 Run YARN Manager
You can manually start YARN resource manager and node manager instances by running the following command on the terminal:
$ sbin/start-yarn.sh
Run jps command again to verify all the running processes,
19203 DataNode 29283 Jps 19413 ResourceManager 19126 NameNode 19303 SecondaryNameNode 19497 NodeManager
The presence of ResourceManager
singals YARN is running. This command is not need if you start
hadoop using start-all.sh
2.7 Start history server
Additionaly to start history server run:
$ bin/mapred --daemon start historyserver
You will se a "JobHistoryServer" process running under jps. Now you can access to history server console, by default, started on port 19888.
2.8 Stop Hadoop cluster
Simply run:
$ sbin/stop-all.sh
2.9 Working with HDFS
You cannot directly browse HDFS from terminal using cat or similar commands. HDFS is a logical file system and does not directly map to Unix file system. You should have an HDFS client and your Hadoop cluster should be running. When you browse HDFS, you are getting your directory structure from namenode and actual data from datanodes.
Although you cannot browse, data is there stored by datanode daemon. By default HDFS uses a series of default values as specified in hdfs-default.
Name | Value | Description |
---|---|---|
dfs.name.dir | /data/hadoop/hdfs/name | Determines where on the local filesystem the DFS name node should store the name table(fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy. |
dfs.data.dir | /data/hadoop/hdfs/data | Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored. |
fs.replication | 3 | Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. |
2.9.1 Configure HDFS Home Directories
We will now configure the hdfs home directories. The home directory is of the form – /user/<username>
so you need to create two directories.
$ sbin/start-all.sh
$ hdfs dfs -mkdir /user $ hdfs dfs -mkdir /user/`whoami`
2.9.2 Create a directory
$ hdfs dfs -mkdir test
2.9.3 List directories
List root directory
$ hdfs dfs -ls /
drwxr-xr-x - hadoop supergroup 0 2018-11-22 17:15 /user
List root directory recursive
$ hdfs dfs -ls -R /
drwxr-xr-x - hadoop supergroup 0 2018-11-22 17:15 /user
drwxr-xr-x - hadoop supergroup 0 2018-11-22 17:15 /user/hadoop
drwxr-xr-x - hadoop supergroup 0 2018-11-22 17:15 /user/hadoop/test
The following example shows a Hadoop filesystem storing various TPCH hive databases.
$ hdfs dfs -ls /user/hadoop/warehouse
Found 5 items
drwxr-xr-x - hadoop supergroup 0 2018-11-26 20:06 /user/hadoop/warehouse/tpch_1.db
drwxr-xr-x - hadoop supergroup 0 2018-11-27 16:42 /user/hadoop/warehouse/tpch_10.db
drwxr-xr-x - hadoop supergroup 0 2018-11-23 16:50 /user/hadoop/warehouse/tpch_100.db
drwxr-xr-x - hadoop supergroup 0 2018-11-26 20:19 /user/hadoop/warehouse/tpch_20.db
drwxr-xr-x - hadoop supergroup 0 2018-11-26 20:33 /user/hadoop/warehouse/tpch_50.db
2.9.4 Remove a diectory
$ hdfs dfs -rm -r test
2.9.5 Copy a file
Now, we can try to copy a file from out local directory to our HDFS home under /user/<username>
$ echo "Hello World" > sample.txt $ hdfs dfs -copyFromLocal sample.txt .
2.9.6 Cat a file
$ hdfs dfs -cat sample.txt
Hello world
2.9.7 fsck
$ hdfs fsck /
Connecting to namenode via http://localhost:9870/fsck?ugi=hadoop&path=%2F
FSCK started by hadoop (auth:SIMPLE) from /127.0.0.1 for path / at Thu Nov 22 17:17:10 CET 2018
Status: HEALTHY
Number of data-nodes: 1
Number of racks: 1
Total dirs: 3
Total symlinks: 0
Replicated Blocks:
Total size: 12 B
Total files: 1
Total blocks (validated): 1 (avg. block size 12 B)
Minimally replicated blocks: 1 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 1
Average block replication: 1.0
Missing blocks: 0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Erasure Coded Block Groups:
Total size: 0 B
Total files: 0
Total block groups (validated): 0
Minimally erasure-coded block groups: 0
Over-erasure-coded block groups: 0
Under-erasure-coded block groups: 0
Unsatisfactory placement block groups: 0
Average block group size: 0.0
Missing block groups: 0
Corrupt block groups: 0
Missing internal blocks: 0
FSCK ended at Thu Nov 22 17:17:10 CET 2018 in 4 milliseconds
The filesystem under path '/' is HEALTHY
$ hdfs fsck / -files -blocks
....
2.9.8 Expunge
To reclaim disk space
$hdfs dfs -expunge
2.9.9 Disk usage
To see disk space
$hdfs dfs -du -h
38.7 M 38.7 M .hiveJars
12 12 sample.txt
92.8 G 92.8 G warehouse
66 66 words.txt
2.9.10 Using HDFS cache
Centralized Cache Management in HDFS is a mechanism that explicitly caches specific files or directories in memory for improved performance. This is useful for relatively small files that are accessed repeatedly. For example, reference/lookup tables or fact tables that are used in many joins. Once enabled, HDFS will automatically cache selected files, and periodically check for changes and recache the files.
While HDFS and the underlying file system do some caching of files when memory is available, explicit caching using Centralized Cache Management prevents the data from being evicted from memory when processes consume all of the physical memory. As a corollary of this, if you ARE working on a lightly loaded system where there is free memory, you may not see any performance improvement from this method, as the data was already in disk cache. So, your performance testing needs to stress the system.
Let’s look at some key terms and concepts:
- Cache pools: A cache pool is an administrative entity used to manage groups of cache directives. One of the key attributes of the pool it the maximum number of bytes that can be cached for all directives in this pool.
- Cache directives: A cache directive defines a path that should be cached. This can be either a specific file or a single directory. Note that directives are not recursive—They apply to a single directory only, not any sub-directories. So, they would usually be applied to the lowest level directory that contains the actual data files.
-
HDFS Configuration Settings: There is really only one Hadoop configuration setting that is required to turn on Centralized Caching. There are a few others to control the frequency that caching looks for new files, which you can usually leave at default. The following, which is added to the custom hdfs-site.xml, specifies the maximum number of bytes that can be cached on each datanode.
dfs.datanode.max.locked.memory
Remember that this value is in bytes, in contrast with the OS limits which are set in KB.
- OS Limits: Before you implement Centralized Caching, you need to ensure that the locked memory setting on each of the datanodes is set to a value equal or greater than memory specified in the hdfs dfs.datanode.max.locked.memory. On each datanode run the following to determine the limit for locked memory. This will return a value in KB or “unlimited”.
Setting OS limits
Before you implement Centralized Caching, you need to ensure that the locked memory setting on each of the datanodes is set to a value equal or greater than memory specified in the hdfs dfs.datanode.max.locked.memory. On each datanode run the following to determine the limit for locked memory. This will return a value in KB or “unlimited”.
ulimit -l
64
Set memlock limits on each datanode.
# On each datanode (max cacheable memory in KB) example for 4.0 GB echo "* hard memlock 4194304" >> /etc/security/limits.conf echo "* soft memlock 4194304" >> /etc/security/limits.conf
This will take effect after you logout and login again.
ulimit -l
4194304
Setup hdfs-site.xml
Edit hdfs-site.xml
and setup HDFS node cache. For example, to setup a 2GB cache:
<property> <name>dfs.datanode.max.locked.memory</name> <value>2147483648</value> </property>
Add a cache directive
$ hdfs cacheadmin -addDirective -path /user/hadoop/warehouse/tpch_10.db/orders -pool testPool -ttl never
Added cache directive 1
Show directives
$ hdfs cacheadmin -listDirectives -stats
Found 1 entry
ID POOL REPL EXPIRY PATH BYTES_NEEDED BYTES_CACHED FILES_NEEDED FILES_CACHED
1 testPool 1 never /user/hadoop/warehouse/tpch_10.db/orders 1749195031 0 1 0
3 Monitor Hadoop
Once the Hadoop cluster is up and running check the web-ui of the components as described below:
Daemon | Web Interface | Notes |
---|---|---|
NameNode | http://nn_host:port/ | Default HTTP port is 9870 |
ResourceManager | http://rm_host:port/ | Default HTTP port is 9870 |
MapReduce JobHistory Server | http://jhs_host:port/ | Default HTTP port is 19888 |
You can monitor different services from command line tools. For example, to monitor map reduce tasks.
$ mapred job -list
2018-11-27 16:29:28,249 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
2018-11-27 16:29:29,231 INFO conf.Configuration: resource-types.xml not found
2018-11-27 16:29:29,231 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
Total jobs:1
JobId JobName State StartTime UserName Queue Priority UsedContainers RsvdContainers UsedMem RsvdMem NeededMem AM info
job_1543325089327_0083 with q17_part as (
RUNNING 1543332459617 hadoop default DEFAULT 4 0 14336M 0M 14336M http://nuc10:8088/proxy/application_1543325089327_0083/
Som useful commands are:
- mapred job -list
- mapred job -kill [jobid]
- mapred job -logs [jobid]
4 Setup Yarn Resource Management
The fundamental idea of MRv2(YARN) is to split up the two major functionalities—resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM).
The ResourceManager(RM) and per-node slave, the NodeManager (NM), form the data-computation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system.
Let's see how to configure YARN resource management.
4.1 Yarn memory
YARN can manage 3 system resources—memory, CPU and disks. You can tune this parameters in yarn-site.xml
.
Property | Default | Tunned |
---|---|---|
yarn.nodemanager.resource.memory-mb | . | 16000 |
yarn.nodemanager.resource.cpu-vcores | . | 6 |
yarn.nodemanager.resource.io-spindles | . | . |
To view the available resources from each node, you can go to RM UI(http://<IP_of_RM>:8088/cluster/nodes), and find out the “Mem Avail”, “Vcores Avail” and “Disk Avail” from each node.
4.2 Minimum and maximum allocation unit in YARN
Two resources—memory and CPU, have minimum and maximum allocation unit in YARN, as set by the configurations below in yarn-site.xml
.
Property | Default | Tunned |
---|---|---|
yarn.scheduler.minimum-allocation-mb | 1024 | . |
yarn.scheduler.maximum-allocation-mb | 8192 | . |
yarn.scheduler.minimum-allocation-vcores | 1 | . |
yarn.scheduler.maximum-allocation-vcores | 32 | . |
Basically, it means RM can only allocate memory to containers in increments of "yarn.scheduler.minimum-allocation-mb" and not exceed "yarn.scheduler.maximum-allocation-mb";
And it can only allocate CPU vcores to containers in increments of "yarn.scheduler.minimum-allocation-vcores" and not exceed "yarn.scheduler.maximum-allocation-vcores".
If changes required, set above configurations in yarn-site.xml
on RM nodes, and restart RM.
For example, if one job is asking for 1025 MB memory per map container(set mapreduce.map.memory.mb=1025), RM will give it one 2048 MB(2*yarn.scheduler.minimum-allocation-mb) container.
4.3 Virtual/physical memory checker
NodeManager can monitor the memory usage(virtual and physical) of the container. If its virtual memory exceeds “yarn.nodemanager.vmem-pmem-ratio” times the "mapreduce.reduce.memory.mb" or "mapreduce.map.memory.mb", then the container will be killed if “yarn.nodemanager.vmem-check-enabled” is true;
If its physical memory exceeds "mapreduce.reduce.memory.mb" or "mapreduce.map.memory.mb", the container will be killed if “yarn.nodemanager.pmem-check-enabled” is true.
The parameters below can be set in yarn-site.xml
on each NM nodes to override the default behavior.
Property | Default | Tunned |
---|---|---|
yarn.nodemanager.vmem-check-enabled | false | . |
yarn.nodemanager.pmem-check-enabled | true | . |
yarn.nodemanager.vmem-pmem-ratio | 2.1 | . |
4.4 Mapper,Reducer and AM’s resource request
MapReduce v2 job has 3 different container types—Mapper, Reducer and AM (Application Master).
Mapper and Reducer can ask for resources—memory, CPU and disk, while AM can only ask for memory and CPU.
Below are a summary of the configurations of resource requests for the three container types.
The default values can be overridden in mapred-site.xml
on the client node or set in applications like MapReduce java code, Pig and Hive Cli,etc.
Job type | Property | Default | Tunned |
---|---|---|---|
Mapper | mapreduce.map.memory.mb | 1024 | 2048 |
mapreduce.map.java.opts | -Xmx900m | . | |
mapreduce.map.cpu.vcores | 1 | . | |
mapreduce.map.disk | 0.5 | . | |
Reducer | mapreduce.reduce.memory.mb | 3072 | 4096 |
mapreduce.reduce.java.opts | -Xmx2560m | . | |
mapreduce.reduce.cpu.vcores | 1 | . | |
mapreduce.reduce.disk | 1.33 | . | |
AM | yarn.app.mapreduce.am.resource.mb | 1536 | . |
yarn.app.mapreduce.am.command-opts | -Xmx1024m | . | |
yarn.app.mapreduce.am.resource.cpu-vcores | 1 | . |
Each container is actually a JVM process, and above “-Xmx” of java-opts should fit in the allocated memory size. One best practice is to set it to 0.8 * (container memory allocation). For example, if the requested mapper container has mapreduce.map.memory.mb=4096, we can set mapreduce.map.java.opts=-Xmx3277m.
There are many factors which can affect the memory requirement for each container. Such factors include the number of Mappers/Reducers, the file type(plain text file , parquet, ORC), data compression algorithm, type of operations(sort, group-by, aggregation, join), data skew, etc. You should be familiar with the nature of this MapReduce job and figure out the minimum requirement for Mapper,Reducer and AM. Any type of the container can run out of memory and be killed by physical/virtual memory checker, if it doesn't meet the minimum memory requirement. If so, you need to check the AM log and the failed container log to find out the cause.
For example, if the MapReduce job sorts parquet files, Mapper needs to cache the whole Parquet row group in memory. I have done tests to prove that the larger the row group size of parquet files is, the larger Mapper memory is needed. In this case, make sure the Mapper memory is large enough without triggering OOM.
Another example is AM running out of memory. Normally, AM’s 1G java heap size is enough for many jobs. However, if the job is to write lots of parquet files, during commit phase of the job, AM will call ParquetOutputCommitter.commitJob(). It will first read footers of all output parquet files, and write the metadata file named “_metadata” in output directory.
You can read more about how Yarn memory should be configured here
5 Test Hadoop
Hadoop installation contains a number of sample mapreduce jobs. We will run one of them to verify that our hadoop installation is working fine.
- Login as hadoop
-
We will first generate and copy a file from local system to the hdfs home folder.
Copy
cat > /tmp/words.txt hello bye mark mary mark alfred ^D
-
Copy the file to your HDFS root
Copy
$ hdfs dfs -copyFromLocal /tmp/words.txt .
-
Let us run a mapreduce program on this HDFS file
words.txt
to find the number of occurrences of the word "mark" in the file. The results will be place in HDFS folderoutput
.A mapreduce program for word count is available in the Hadoop samples.Copy$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.3.jar grep words.txt output 'mark'
18/02/10 00:01:08 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 18/02/10 00:01:09 INFO input.FileInputFormat: Total input files to process : 1 18/02/10 00:01:09 INFO mapreduce.JobSubmitter: number of splits:1 18/02/10 00:01:10 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1518216494723_0003 18/02/10 00:01:10 INFO impl.YarnClientImpl: Submitted application application_1518216494723_0003 18/02/10 00:01:10 INFO mapreduce.Job: The url to track the job: http://iMac-test.local:8088/proxy/application_1518216494723_0003/ 18/02/10 00:01:10 INFO mapreduce.Job: Running job: job_1518216494723_0003 18/02/10 00:01:19 INFO mapreduce.Job: Job job_1518216494723_0003 running in uber mode : false 18/02/10 00:01:19 INFO mapreduce.Job: map 0% reduce 0% 18/02/10 00:01:25 INFO mapreduce.Job: map 100% reduce 0% ...
This runs the mapreduce on the hdfs file uploaded earlier and then outputs the results to the
output
folder inside the hdfs home folder. The file will be named as part-r-00000. This can be downloaded from the name node browser console or run the following command to copy it to the local folder. -
Now, change to a working directory and download HDFS
output
folder to look into results.Copy$ cd /tmp $ hdfs dfs -get output/* . $ cd output $ ls -l
total 8 -rw-r--r-- 1 deister wheel 0 10 feb 00:14 _SUCCESS -rw-r--r-- 1 deister wheel 7 10 feb 00:14 part-r-00000
-
Finally, check the results of the word count job searhing word
mark
Copy$ cat part-r-00000
2 mark
6 Install Hive
To read about hive SQL click here
6.1 Downlod and install
-
Login as root and go to /home directory. Seach for a Hive repository and download it.
Copy
# wget http://apache.rediris.es/hive/hive-3.1.1/apache-hive-3.1.1-bin.tar.gz
-
Extract Hive archive
Copy
# tar xvfz apache-hive-3.1.1-bin.tar.gz
apache-hive-3.1.1-bin/LICENSE apache-hive-3.1.1-bin/RELEASE_NOTES.txt apache-hive-3.1.1-bin/NOTICE apache-hive-3.1.1-bin/binary-package-licenses/com.thoughtworks.paranamer-LICENSE apache-hive-3.1.1-bin/binary-package-licenses/org.codehaus.janino-LICENSE apache-hive-3.1.1-bin/binary-package-licenses/org.jamon.jamon-runtime-LICENSE apache-hive-3.1.1-bin/binary-package-licenses/org.mozilla.rhino-LICENSE apache-hive-3.1.1-bin/binary-package-licenses/org.jruby-LICENSE ...
-
Add user hive (but emove it's directory cause we will use apache hive directory)
Copy
# useradd hive # rmdir hive
-
Create a symbolink link
Copy
# ln -s /home/apache-hive-3.1.1-bin hive
-
Change owner and group to hive
Copy
# chown -R hive:hive /home/apache-hive-3.1.1-bin
-
Login as hive
Copy
# su - hive
-
Create the .bashrc file
Copy
$ cat > .bashrc # .bashrc # Source global definitions if [ -f /etc/bashrc ]; then . /etc/bashrc fi ^D
-
Create a .bash_profile adding path to Hadoop and Hive binaries
Copy
$ cat > .bash_profile # .bash_profile # Get the aliases and functions if [ -f ~/.bashrc ]; then . ~/.bashrc fi
6.2 Run hive from hadoop account
Now, we will test is correcly installed.
-
Login as hadoop and setup the path to hive. Edit the .bash_profile and add
Copy
export PATH="/home/hive/bin:${PATH}"
-
Create the metastore db. This will create
metastore_db
directory that contains Hive metadata.Copy$ schematool -initSchema -dbType derby
- Ensure the version of guava is the same for Hive and Hadoop.
- Go to $HIVE_HOME (%HIVE_HOME%)/lib folder and find out the version of guava. For Hive 3.0.0, it is guava-19.0.jar.
- Go to $HADOOP_HOME (%HADOOP_HOME%)/share/hadoop/common/lib folder and find out the version of guava. For Hadoop 3.2.1, the version is guava-27.0-jre.jar.
- If they are not same (which is true for this case), delete the older version and copy the newer version in both. In this case, delete guava-19.0.jar in Hive lib folder, and then copy guava-27.0-jre.jar from Hadoop folder to Hive.
-
Now you are ready to run hive
Copy
$ hive
SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/apache-hive-3.1.1-bin/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/hadoop-3.0.3/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] Hive Session ID = 65408e69-cd89-469f-bcca-410bc8447b80
-
Run the command show databases.
Copy
hive> show databases;
default Time taken: 0.592 seconds, Fetched: 1 row(s) hive> OK
6.3 Debug Hive
You can start hive with debug option.
$ hive --hiveconf hive.execution.engine=mr --hiveconf hive.root.logger=DEBUG,console
7 Setup Hive
Login as haddop (we will run hive from hadoop account to simplify the need to set HDFS user permissions for hive account)
7.1 Hive metastore
Configuring metastore means specifying to Hive where the database is stored.
All hive implementation need a metastore service, where it stores metadata. It is implemented using tables in relational database. By default, Hive uses built-in Derby SQL server. It provides single process storage, so when we use Derby we can not run instance of Hive CLI. Whenever we want to run Hive on a personal machine or for some developer task than it is good but when we want to use it on cluster then MYSQL or any other similar relational database is required.
Now when you run your hive query and you are using default derby database you will find that your current directory now contains a new sub-directory metastore_db. Also the metastore will be created if it doesn’t already exist.
The property of interest here is javax.jdo.option.ConnectionURL.
The default value of this property is
jdbc:derby:;databaseName=metastore_db;create=true
.
This value specifies that you will be using embedded derby as your Hive metastore and the location of the metastore is metastore_db.
7.1.1 Config hive-site.xml
-
Copy the hive-default-xml template as hive-site.xml
Copy
$ cp conf/hive-default.xml.template conf/hive-site.xml
-
Set the default properties for tmpdir and user name on top of $HOME/hive/conf/hive-site.xml
Copy
<property> <name>system:java.io.tmpdir</name> <value>/tmp/${user.name}/java</value> </property> <property> <name>system:user.name</name> <value>${user.name}</value> </property>
If you miss to setup this values you will get an exception when running hive.java.net.URISyntaxException: Relative path in absolute URI: ${system:java.io.tmpdir%7D/$%7Bsystem:user.name%7D
-
We can also configure directory for hive store table information.
By default, the location of warehouse is
/user/${user.name}/warehouse
as specified in hive-site.xml.Copy<property> <name>hive.metastore.warehouse.dir</name> <value>/user/${user.name}/warehouse</value> <description>location of default database for the warehouse</description> </property>
-
Notice this is a location pointing to HDFS,
so it must exists before you plan to create any database,
so create the warehouse directory for hive in HDFS
Copy
$ hadoop fs -mkdir /user/hadoop/warehouse $ hdfs dfs -ls -R /
drwxr-xr-x - hadoop supergroup 0 2018-11-22 18:50 /user drwxr-xr-x - hadoop supergroup 0 2018-11-22 18:50 /user/hadoop drwxr-xr-x - hadoop supergroup 0 2018-11-22 18:50 /user/hadoop/warehouse
7.2 Tunning Hive
Hive automtically determines number of reducers based on the following formula
$$reducers = \frac{bytes.of.input.to.mappers}{hive.exec.reducers.bytes.per.reducer}$$
You can limit the number of reducers produced by this heuristic using hive.exec.reducers.max.
If you know exactly the number of reducers you want, you can set mapred.reduce.tasks, and this will override all heuristics. (By default this is set to -1, indicating Hive should use its heuristics.)
8 Hive+Tez instead of MR
Tez is a DAG (Directed acyclic graph) architecture. Tez generalizes the MapReduce paradigm to a more powerful framework based on expressing computations as a dataflow graph. Hive can be run in Tez instead of Map Reduce.
A typical Map reduce job has following steps:
- Read data from file --> one disk access
- Run mappers
- Write map output --> second disk access
- Run shuffle and sort --> read map output, third disk access
- write shuffle and sort --> write sorted data for reducers --> fourth disk access
- Run reducers which reads sorted data --> fifth disk output
- Write reducers output --> sixth disk access
Tez works very similar to Spark (Tez was created by Hortonworks well before Spark):
- Execute the plan but no need to read data from disk.
- Once ready to do some calculations (similar to actions in spark), get the data from disk and perform all steps and produce output.
Only one read and one write.
Notice the efficiency introduced by not going to disk multiple times. Intermediate results are stored in memory (not written to disks). On top of that there is vectorization (process batch of rows instead of one row at a time). All this adds to efficiencies in query time.
8.1 Setup Tez
TO DO
This section is incomplete and will be concluded as soon as possible.9 Load data
9.1 Genetate TPCH files
For example, to generate a scale 100 run.
export DSS_PATH=/data/TPCH/data100 mkdir $DSS_PATH ./dbgen -s 100 -f
9.2 Start hive
Start hive and check everything is working properly.
$ hive
hive >
9.3 Create database and tables
Execute hive and run SQL Commands to create TPCH database and tables. Hive supports several storage formats:
-
TEXTFILE
: if your data is delimited by some parameters -
ORCFILE
: (Optimized row columnar) if you want to store your data in an optimized way which lessens your storage and increases your performance. -
RCFILE
: (Record Columnar File) if you want to perform analytics on your data and you want to store your data efficiently. -
SEQUENCEFILE
: if your data is in small files whose size is less than the block size.
DROP DATABASE IF EXISTS tpch_100 CASCADE; CREATE DATABASE tpch_100; USE tpch_100; CREATE TABLE IF NOT EXISTS nation ( n_nationkey integer , n_name char(25) , n_regionkey integer , n_comment varchar(152) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'; CREATE TABLE IF NOT EXISTS region ( r_regionkey integer , r_name char(25) , r_comment varchar(152) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'; CREATE TABLE IF NOT EXISTS part ( p_partkey integer , p_name varchar(55) , p_mfgr char(25) , p_brand char(10) , p_type varchar(25) , p_size integer , p_container char(10) , p_retailprice decimal(15,2) , p_comment varchar(23) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'; CREATE TABLE IF NOT EXISTS supplier ( s_suppkey integer , s_name char(25) , s_address varchar(40) , s_nationkey integer , s_phone char(15) , s_acctbal decimal(15,2) , s_comment varchar(101) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'; CREATE TABLE IF NOT EXISTS partsupp ( ps_partkey integer , ps_suppkey integer , ps_availqty integer , ps_supplycost decimal(15,2) , ps_comment varchar(199) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'; CREATE TABLE IF NOT EXISTS customer ( c_custkey integer , c_name varchar(25) , c_address varchar(40) , c_nationkey integer , c_phone char(15) , c_acctbal decimal(15,2) , c_mktsegment char(10) , c_comment varchar(117) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'; CREATE TABLE IF NOT EXISTS orders ( o_orderkey integer , o_custkey integer , o_orderstatus char(1) , o_totalprice decimal(15,2) , o_orderdate date , o_orderpriority char(15) , o_clerk char(15) , o_shippriority integer , o_comment varchar(79) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'; CREATE TABLE IF NOT EXISTS lineitem ( l_orderkey integer , l_partkey integer , l_suppkey integer , l_linenumber integer , l_quantity decimal(15,2) , l_extendedprice decimal(15,2) , l_discount decimal(15,2) , l_tax decimal(15,2) , l_returnflag char(1) , l_linestatus char(1) , l_shipdate date , l_commitdate date , l_receiptdate date , l_shipinstruct char(25) , l_shipmode char(10) , l_comment varchar(44) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|';
DROP DATABASE IF EXISTS tpch_100 CASCADE; CREATE DATABASE tpch_100; USE tpch_100; create table region stored as ${FILE} TBLPROPERTIES('orc.bloom.filter.columns'='*','orc.compress'='ZLIB') as select distinct * from ${SOURCE}.region; create table nation stored as ${FILE} TBLPROPERTIES('orc.bloom.filter.columns'='*','orc.compress'='ZLIB') as select distinct * from ${SOURCE}.nation; create table customer stored as ${FILE} TBLPROPERTIES('orc.bloom.filter.columns'='*','orc.compress'='ZLIB') as select * from ${SOURCE}.customer cluster by C_MKTSEGMENT; create table supplier stored as ${FILE} TBLPROPERTIES('orc.bloom.filter.columns'='*','orc.compress'='ZLIB') as select * from ${SOURCE}.supplier cluster by s_nationkey, s_suppkey; create table part stored as ${FILE} TBLPROPERTIES('orc.bloom.filter.columns'='*','orc.compress'='ZLIB') as select * from ${SOURCE}.part cluster by p_brand; create table partsupp stored as ${FILE} TBLPROPERTIES('orc.bloom.filter.columns'='*','orc.compress'='ZLIB') as select * from ${SOURCE}.partsupp cluster by PS_SUPPKEY; create table orders (O_ORDERKEY BIGINT, O_CUSTKEY BIGINT, O_ORDERSTATUS STRING, O_TOTALPRICE DOUBLE, O_ORDERPRIORITY STRING, O_CLERK STRING, O_SHIPPRIORITY INT, O_COMMENT STRING) partitioned by (O_ORDERDATE STRING) stored as ${FILE} ; ALTER TABLE orders SET TBLPROPERTIES('orc.bloom.filter.columns'='*','orc.compress'='ZLIB'); INSERT OVERWRITE TABLE orders partition(O_ORDERDATE) select O_ORDERKEY , O_CUSTKEY , O_ORDERSTATUS , O_TOTALPRICE , O_ORDERPRIORITY , O_CLERK , O_SHIPPRIORITY , O_COMMENT, O_ORDERDATE from ${SOURCE}.orders ; create table lineitem (L_ORDERKEY BIGINT, L_PARTKEY BIGINT, L_SUPPKEY BIGINT, L_LINENUMBER INT, L_QUANTITY DOUBLE, L_EXTENDEDPRICE DOUBLE, L_DISCOUNT DOUBLE, L_TAX DOUBLE, L_RETURNFLAG STRING, L_LINESTATUS STRING, L_COMMITDATE STRING, L_RECEIPTDATE STRING, L_SHIPINSTRUCT STRING, L_SHIPMODE STRING, L_COMMENT STRING) partitioned by (L_SHIPDATE STRING) stored as ${FILE} ; ALTER TABLE lineitem SET TBLPROPERTIES('orc.bloom.filter.columns'='*','orc.compress'='ZLIB'); INSERT OVERWRITE TABLE lineitem Partition(L_SHIPDATE) select L_ORDERKEY , L_PARTKEY , L_SUPPKEY , L_LINENUMBER , L_QUANTITY , L_EXTENDEDPRICE , L_DISCOUNT , L_TAX , L_RETURNFLAG , L_LINESTATUS , L_COMMITDATE , L_RECEIPTDATE , L_SHIPINSTRUCT , L_SHIPMODE , L_COMMENT , L_SHIPDATE from ${SOURCE}.lineitem ; analyze table nation compute statistics for columns; analyze table region compute statistics for columns; analyze table supplier compute statistics for columns; analyze table part compute statistics for columns; analyze table partsupp compute statistics for columns; analyze table customer compute statistics for columns; analyze table orders compute statistics for columns; analyze table lineitem compute statistics for columns;
9.4 Load data
From hive shell run the following commands to load TPCH 100 generated files:
LOAD DATA LOCAL INPATH '/data/TPCH/data100/nation.tbl' OVERWRITE INTO TABLE nation; LOAD DATA LOCAL INPATH '/data/TPCH/data100/region.tbl' OVERWRITE INTO TABLE region; LOAD DATA LOCAL INPATH '/data/TPCH/data100/supplier.tbl' OVERWRITE INTO TABLE supplier; LOAD DATA LOCAL INPATH '/data/TPCH/data100/customer.tbl' OVERWRITE INTO TABLE customer; LOAD DATA LOCAL INPATH '/data/TPCH/data100/part.tbl' OVERWRITE INTO TABLE part; LOAD DATA LOCAL INPATH '/data/TPCH/data100/partsupp.tbl' OVERWRITE INTO TABLE partsupp; LOAD DATA LOCAL INPATH '/data/TPCH/data100/orders.tbl' OVERWRITE INTO TABLE orders; LOAD DATA LOCAL INPATH '/data/TPCH/data100/lineitem.tbl' OVERWRITE INTO TABLE lineitem;
Table | SF=100 | SF=1000 | ||||
---|---|---|---|---|---|---|
Rows | Size GB |
Time | Rows | Size GB |
Time | |
region | 1 | 0 | 00:00 | . | 0 | . |
nation | 1 | 0 | 00:00 | . | 0 | . |
supplier | 1M | 0.14 | 00:02 | . | 0 | . |
customer | 15M | 2.3 | 00:03 | . | 0 | . |
part | 20M | 2.6 | 00:01 | . | 0 | . |
partsupp | 80M | 11 | 00:13 | . | 0 | . |
orders | 150M | 15 | 00:25 | . | 0 | . |
lineitem | 600M | 64 | 01:08 | . | 0 | . |
Total table load | 00:01:52 | ? |
9.5 Using ORC storage
To use ORC storage, first create a table to load delimited file, the copy to ORC format.
CREATE TABLE IF NOT EXISTS lineitem_ext ( l_orderkey integer , l_partkey integer , l_suppkey integer , l_linenumber integer , l_quantity decimal(15,2) , l_extendedprice decimal(15,2) , l_discount decimal(15,2) , l_tax decimal(15,2) , l_returnflag char(1) , l_linestatus char(1) , l_shipdate date , l_commitdate date , l_receiptdate date , l_shipinstruct char(25) , l_shipmode char(10) , l_comment varchar(44) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'; LOAD DATA LOCAL INPATH '/data/TPCH/data100/lineitem.tbl' OVERWRITE INTO TABLE lineitem; create table if not exists lineitem (L_ORDERKEY BIGINT, L_PARTKEY BIGINT, L_SUPPKEY BIGINT, L_LINENUMBER INT, L_QUANTITY DOUBLE, L_EXTENDEDPRICE DOUBLE, L_DISCOUNT DOUBLE, L_TAX DOUBLE, L_RETURNFLAG STRING, L_LINESTATUS STRING, L_SHIPDATE STRING, L_COMMITDATE STRING, L_RECEIPTDATE STRING, L_SHIPINSTRUCT STRING, L_SHIPMODE STRING, L_COMMENT STRING) STORED AS ORC TBLPROPERTIES ("orc.compress"="SNAPPY") ;
9.6 Creating statistics
Statistics such as the number of rows of a table or partition and the histograms of a particular interesting column are important in many ways. One of the key use cases of statistics is query optimization. Statistics serve as the input to the cost functions of the optimizer so that it can compare different plans and choose among them.
analyze table region compute statistics; analyze table nation compute statistics; analyze table supplier compute statistics; analyze table customer compute statistics; analyze table part compute statistics; analyze table partsupp compute statistics; analyze table orders compute statistics; analyze table lineitem compute statistics;
10 Running querys
Now you can run Hive SQL querys using either MR or Tez.
TO DO
This section is incomplete and will be concluded as soon as possible.11 Hadoop cluster
At etc/hadoop/workers set the workers (slaves) nodes.
Copy the content from the master at etc/hadoop to the rest of nodes.