Write here your abstract

1 Create TPCH 100 test in Hive

1.1 Start hive

Start hive and check everything is working properly.

Copy
$ hive
which: no hbase in (/home/hadoop/db-derby/bin:/home/hadoop/hive/bin:/home/hadoop/hadoop/bin:/home/hadoop/hadoop/sbin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/hadoop/.local/bin:/home/hadoop/bin)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in jar:file:/home/hadoop/hive/lib/hive-common-2.3.2.jar!/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive >

1.2 Create database and tables

Execute hive and run SQL Commands to create TPCH database and tables:

Copy
DROP DATABASE IF EXISTS tpch CASCADE;
CREATE DATABASE tpch;
USE tpch;

CREATE TABLE IF NOT EXISTS nation (
    n_nationkey integer ,
    n_name char(25) ,
    n_regionkey integer ,
    n_comment varchar(152) 
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|';

CREATE TABLE IF NOT EXISTS region (
    r_regionkey integer ,
    r_name char(25) ,
    r_comment varchar(152) 
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|';

CREATE TABLE IF NOT EXISTS part (
    p_partkey integer  ,
    p_name varchar(55)  ,
    p_mfgr char(25)  ,
    p_brand char(10)  ,
    p_type varchar(25)  ,
    p_size integer  ,
    p_container char(10) ,
    p_retailprice decimal(15,2) ,
    p_comment varchar(23) 
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|';

CREATE TABLE IF NOT EXISTS supplier (
    s_suppkey integer  ,
    s_name char(25)  ,
    s_address varchar(40) ,
    s_nationkey integer  ,
    s_phone char(15) ,
    s_acctbal decimal(15,2) ,
    s_comment varchar(101) 
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|';

CREATE TABLE IF NOT EXISTS partsupp (
    ps_partkey integer ,
    ps_suppkey integer ,
    ps_availqty integer ,
    ps_supplycost decimal(15,2) ,
    ps_comment varchar(199) 
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|';

CREATE TABLE IF NOT EXISTS customer (
    c_custkey integer  ,
    c_name varchar(25)  ,
    c_address varchar(40) ,
    c_nationkey integer  ,
    c_phone char(15)  ,
    c_acctbal decimal(15,2)  ,
    c_mktsegment char(10)  ,
    c_comment varchar(117) 
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|';

CREATE TABLE IF NOT EXISTS orders (
    o_orderkey integer ,
    o_custkey integer ,
    o_orderstatus char(1) ,
    o_totalprice decimal(15,2) ,
    o_orderdate date  ,
    o_orderpriority char(15)  ,
    o_clerk char(15)  ,
    o_shippriority integer  ,
    o_comment varchar(79)  
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|';

CREATE TABLE IF NOT EXISTS lineitem (
    l_orderkey integer  ,
    l_partkey integer  ,
    l_suppkey integer  ,
    l_linenumber integer  ,
    l_quantity decimal(15,2)  ,
    l_extendedprice decimal(15,2)  ,
    l_discount decimal(15,2)  ,
    l_tax decimal(15,2)  ,
    l_returnflag char(1)  ,
    l_linestatus char(1)  ,
    l_shipdate date  ,
    l_commitdate date  ,
    l_receiptdate date  ,
    l_shipinstruct char(25)  ,
    l_shipmode char(10)  ,
    l_comment varchar(44) 
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|';

1.3 Load data

In this example we're using TPCH-100 data files generated for Informix TPCH tests. Hive should be started from $HADOOP_HOME (home/hadoop) and pointing to files wherever they are located

Start hive and run loading SQL commands:

Copy
LOAD DATA LOCAL INPATH 'tpch/unlfiles/nation.tbl' OVERWRITE INTO TABLE nation;
LOAD DATA LOCAL INPATH 'tpch/unlfiles/region.tbl' OVERWRITE INTO TABLE region;
LOAD DATA LOCAL INPATH 'tpch/unlfiles/part.tbl' OVERWRITE INTO TABLE part;
LOAD DATA LOCAL INPATH 'tpch/unlfiles/supplier.tbl' OVERWRITE INTO TABLE supplier;
LOAD DATA LOCAL INPATH 'tpch/unlfiles/partsupp.tbl' OVERWRITE INTO TABLE partsupp;
LOAD DATA LOCAL INPATH 'tpch/unlfiles/customer.tbl' OVERWRITE INTO TABLE customer;
LOAD DATA LOCAL INPATH 'tpch/unlfiles/orders.tbl' OVERWRITE INTO TABLE orders;
LOAD DATA LOCAL INPATH 'tpch/unlfiles/lineitem.tbl' OVERWRITE INTO TABLE lineitem;

2 Performance figures

2.1 Data loading

This is the second section: Write here your text

File TPCH 100 Load time
nation.tbl 0.837 seconds
region.tbl 1.585 seconds
part.tbl 20.081 seconds
supplier.tbl 5.447 seconds
partsupp.tbl 143 seconds
customer.tbl 45 seconds
orders.tbl 183 seconds
lineitem.tbl 867 seconds

2.2 SQL Execution

Query 1: This query took 03:56 min. to execute in Hadoop. For comparission, same query took 01:16 min. in Informix IDS on same architecture.

Copy
hive> select sum(o_totalprice) from orders;
Query ID = hadoop_20180212200128_95ef3447-7712-4dcc-83be-e8a9b668fba0
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2018-02-12 20:01:32,772 Stage-1 map = 0%,  reduce = 0%
2018-02-12 20:01:36,911 Stage-1 map = 100%,  reduce = 0%
2018-02-12 20:02:37,629 Stage-1 map = 100%,  reduce = 0%
2018-02-12 20:03:37,814 Stage-1 map = 100%,  reduce = 0%
2018-02-12 20:04:38,466 Stage-1 map = 100%,  reduce = 0%
2018-02-12 20:05:24,432 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_local425811313_0001
MapReduce Jobs Launched: 
Stage-Stage-1:  HDFS Read: 629116219674 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
22668168442418.61
Time taken: 236.214 seconds, Fetched: 1 row(s)

Query 2: This query took 02:52 min. to execute in Hadoop. For comparission, same query took 52 seconds in Informix IDS using memory buffers.

Copy
hive> select sum(o_totalprice) from orders where o_orderkey = -1;
Query ID = hadoop_20180213103811_9d33ac88-3f82-4f32-9c1d-8764d4920fee
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2018-02-13 10:38:13,258 Stage-1 map = 0%,  reduce = 0%
2018-02-13 10:38:15,446 Stage-1 map = 100%,  reduce = 0%
2018-02-13 10:39:15,926 Stage-1 map = 100%,  reduce = 0%
2018-02-13 10:40:16,652 Stage-1 map = 100%,  reduce = 0%
2018-02-13 10:41:03,273 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_local1419924452_0002
MapReduce Jobs Launched: 
Stage-Stage-1:  HDFS Read: 629116224910 HDFS Write: 5644 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
NULL
Time taken: 172.021 seconds, Fetched: 1 row(s)

2.3 TPCH Tests

Query 1

Copy
select l_returnflag,
       l_linestatus,
       sum(l_quantity) as sum_qty,
       sum(l_extendedprice) as sum_base_price,
       sum(l_extendedprice * (1-l_discount)) as sum_disc_price,
       sum(l_extendedprice * (1-l_discount) * (1+l_tax)) as sum_charge,
       avg(l_quantity) as avg_qty,
       avg(l_extendedprice) as avg_price,
       avg(l_discount) as avg_disc,
       count(*) as count_order
 from lineitem
 where l_shipdate <= DATE_SUB(TO_DATE('1998-12-01'), 90)
 group by l_returnflag, l_linestatus
 order by l_returnflag, l_linestatus;

Query 2: Not supported by Hadoop hive

Copy
SELECT s_acctbal, s_name, n_name, p_partkey, p_mfgr, s_address, s_phone, s_comment
FROM part, supplier, partsupp, nation, region
WHERE p_partkey = ps_partkey
     AND s_suppkey = ps_suppkey
     AND p_size = 15
     AND p_type LIKE '%BRASS'
     AND s_nationkey = n_nationkey
     AND n_regionkey = r_regionkey
     AND r_name = 'EUROPE'
     AND ps_supplycost = (
		SELECT MIN(ps_supplycost)
		FROM partsupp, supplier, nation, region
		WHERE
			p_partkey = ps_partkey
			AND s_suppkey = ps_suppkey
			AND s_nationkey = n_nationkey
			AND n_regionkey = r_regionkey
			AND r_name = 'EUROPE'
     )
ORDER BY s_acctbal DESC, n_name, s_name, p_partkey
LIMIT 100;

Query 3: Executed for TCPH 100 in seconds:

Copy
SELECT l_orderkey, SUM(l_extendedprice * (1 - l_discount)) AS revenue, o_orderdate, o_shippriority
FROM customer, orders, lineitem
WHERE c_mktsegment = 'BUILDING'
     AND c_custkey = o_custkey
     AND l_orderkey = o_orderkey
     AND o_orderdate < TO_DATE('1995-03-15')
     AND l_shipdate  > TO_DATE('1995-03-15')
GROUP BY l_orderkey, o_orderdate, o_shippriority
ORDER BY revenue DESC, o_orderdate;