1 Create TPCH 100 test in Hive
1.1 Start hive
Start hive and check everything is working properly.
$ hive
which: no hbase in (/home/hadoop/db-derby/bin:/home/hadoop/hive/bin:/home/hadoop/hadoop/bin:/home/hadoop/hadoop/sbin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/hadoop/.local/bin:/home/hadoop/bin)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Logging initialized using configuration in jar:file:/home/hadoop/hive/lib/hive-common-2.3.2.jar!/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive >
1.2 Create database and tables
Execute hive and run SQL Commands to create TPCH database and tables:
DROP DATABASE IF EXISTS tpch CASCADE; CREATE DATABASE tpch; USE tpch; CREATE TABLE IF NOT EXISTS nation ( n_nationkey integer , n_name char(25) , n_regionkey integer , n_comment varchar(152) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'; CREATE TABLE IF NOT EXISTS region ( r_regionkey integer , r_name char(25) , r_comment varchar(152) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'; CREATE TABLE IF NOT EXISTS part ( p_partkey integer , p_name varchar(55) , p_mfgr char(25) , p_brand char(10) , p_type varchar(25) , p_size integer , p_container char(10) , p_retailprice decimal(15,2) , p_comment varchar(23) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'; CREATE TABLE IF NOT EXISTS supplier ( s_suppkey integer , s_name char(25) , s_address varchar(40) , s_nationkey integer , s_phone char(15) , s_acctbal decimal(15,2) , s_comment varchar(101) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'; CREATE TABLE IF NOT EXISTS partsupp ( ps_partkey integer , ps_suppkey integer , ps_availqty integer , ps_supplycost decimal(15,2) , ps_comment varchar(199) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'; CREATE TABLE IF NOT EXISTS customer ( c_custkey integer , c_name varchar(25) , c_address varchar(40) , c_nationkey integer , c_phone char(15) , c_acctbal decimal(15,2) , c_mktsegment char(10) , c_comment varchar(117) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'; CREATE TABLE IF NOT EXISTS orders ( o_orderkey integer , o_custkey integer , o_orderstatus char(1) , o_totalprice decimal(15,2) , o_orderdate date , o_orderpriority char(15) , o_clerk char(15) , o_shippriority integer , o_comment varchar(79) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'; CREATE TABLE IF NOT EXISTS lineitem ( l_orderkey integer , l_partkey integer , l_suppkey integer , l_linenumber integer , l_quantity decimal(15,2) , l_extendedprice decimal(15,2) , l_discount decimal(15,2) , l_tax decimal(15,2) , l_returnflag char(1) , l_linestatus char(1) , l_shipdate date , l_commitdate date , l_receiptdate date , l_shipinstruct char(25) , l_shipmode char(10) , l_comment varchar(44) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|';
1.3 Load data
In this example we're using TPCH-100 data files generated for Informix TPCH tests. Hive should be started from $HADOOP_HOME (home/hadoop) and pointing to files wherever they are located
Start hive and run loading SQL commands:
LOAD DATA LOCAL INPATH 'tpch/unlfiles/nation.tbl' OVERWRITE INTO TABLE nation; LOAD DATA LOCAL INPATH 'tpch/unlfiles/region.tbl' OVERWRITE INTO TABLE region; LOAD DATA LOCAL INPATH 'tpch/unlfiles/part.tbl' OVERWRITE INTO TABLE part; LOAD DATA LOCAL INPATH 'tpch/unlfiles/supplier.tbl' OVERWRITE INTO TABLE supplier; LOAD DATA LOCAL INPATH 'tpch/unlfiles/partsupp.tbl' OVERWRITE INTO TABLE partsupp; LOAD DATA LOCAL INPATH 'tpch/unlfiles/customer.tbl' OVERWRITE INTO TABLE customer; LOAD DATA LOCAL INPATH 'tpch/unlfiles/orders.tbl' OVERWRITE INTO TABLE orders; LOAD DATA LOCAL INPATH 'tpch/unlfiles/lineitem.tbl' OVERWRITE INTO TABLE lineitem;
2 Performance figures
2.1 Data loading
This is the second section: Write here your text
File | TPCH 100 Load time |
---|---|
nation.tbl | 0.837 seconds |
region.tbl | 1.585 seconds |
part.tbl | 20.081 seconds |
supplier.tbl | 5.447 seconds |
partsupp.tbl | 143 seconds |
customer.tbl | 45 seconds |
orders.tbl | 183 seconds |
lineitem.tbl | 867 seconds |
2.2 SQL Execution
Query 1: This query took 03:56 min. to execute in Hadoop. For comparission, same query took 01:16 min. in Informix IDS on same architecture.
hive> select sum(o_totalprice) from orders;
Query ID = hadoop_20180212200128_95ef3447-7712-4dcc-83be-e8a9b668fba0
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2018-02-12 20:01:32,772 Stage-1 map = 0%, reduce = 0%
2018-02-12 20:01:36,911 Stage-1 map = 100%, reduce = 0%
2018-02-12 20:02:37,629 Stage-1 map = 100%, reduce = 0%
2018-02-12 20:03:37,814 Stage-1 map = 100%, reduce = 0%
2018-02-12 20:04:38,466 Stage-1 map = 100%, reduce = 0%
2018-02-12 20:05:24,432 Stage-1 map = 100%, reduce = 100%
Ended Job = job_local425811313_0001
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 629116219674 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
22668168442418.61
Time taken: 236.214 seconds, Fetched: 1 row(s)
Query 2: This query took 02:52 min. to execute in Hadoop. For comparission, same query took 52 seconds in Informix IDS using memory buffers.
hive> select sum(o_totalprice) from orders where o_orderkey = -1;
Query ID = hadoop_20180213103811_9d33ac88-3f82-4f32-9c1d-8764d4920fee
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2018-02-13 10:38:13,258 Stage-1 map = 0%, reduce = 0%
2018-02-13 10:38:15,446 Stage-1 map = 100%, reduce = 0%
2018-02-13 10:39:15,926 Stage-1 map = 100%, reduce = 0%
2018-02-13 10:40:16,652 Stage-1 map = 100%, reduce = 0%
2018-02-13 10:41:03,273 Stage-1 map = 100%, reduce = 100%
Ended Job = job_local1419924452_0002
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 629116224910 HDFS Write: 5644 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
NULL
Time taken: 172.021 seconds, Fetched: 1 row(s)
2.3 TPCH Tests
Query 1
select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice * (1-l_discount)) as sum_disc_price, sum(l_extendedprice * (1-l_discount) * (1+l_tax)) as sum_charge, avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(*) as count_order from lineitem where l_shipdate <= DATE_SUB(TO_DATE('1998-12-01'), 90) group by l_returnflag, l_linestatus order by l_returnflag, l_linestatus;
Query 2: Not supported by Hadoop hive
SELECT s_acctbal, s_name, n_name, p_partkey, p_mfgr, s_address, s_phone, s_comment FROM part, supplier, partsupp, nation, region WHERE p_partkey = ps_partkey AND s_suppkey = ps_suppkey AND p_size = 15 AND p_type LIKE '%BRASS' AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'EUROPE' AND ps_supplycost = ( SELECT MIN(ps_supplycost) FROM partsupp, supplier, nation, region WHERE p_partkey = ps_partkey AND s_suppkey = ps_suppkey AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'EUROPE' ) ORDER BY s_acctbal DESC, n_name, s_name, p_partkey LIMIT 100;
Query 3: Executed for TCPH 100 in seconds:
SELECT l_orderkey, SUM(l_extendedprice * (1 - l_discount)) AS revenue, o_orderdate, o_shippriority FROM customer, orders, lineitem WHERE c_mktsegment = 'BUILDING' AND c_custkey = o_custkey AND l_orderkey = o_orderkey AND o_orderdate < TO_DATE('1995-03-15') AND l_shipdate > TO_DATE('1995-03-15') GROUP BY l_orderkey, o_orderdate, o_shippriority ORDER BY revenue DESC, o_orderdate;