To generate data on top of Hadoop HDFS we need to run dbgen manually and copy data files to HDFS or use a Map Reduce wrapper around it.

1 Copy data into HDFS

The simply way to place data on HDFS is using the HDFS copy command. So, generate files using dbgen in a local storage and transfer to HDFS.

Copy
$HADOOP_HOME/bin/hadoop fs -mkdir /tpch/ 
echo "mkdir /tpch/"

$HADOOP_HOME/bin/hadoop fs -mkdir /tpch/customer
echo "mkdir customer"

$HADOOP_HOME/bin/hadoop fs -mkdir /tpch/lineitem
echo "mkdir lineitem"

$HADOOP_HOME/bin/hadoop fs -mkdir /tpch/nation
echo "mkdir nation"

$HADOOP_HOME/bin/hadoop fs -mkdir /tpch/orders
echo "mkdir orders"

$HADOOP_HOME/bin/hadoop fs -mkdir /tpch/part
echo "mkdir part"

$HADOOP_HOME/bin/hadoop fs -mkdir /tpch/partsupp
echo "mkdir partsupp"

$HADOOP_HOME/bin/hadoop fs -mkdir /tpch/region
echo "mkdir region"

$HADOOP_HOME/bin/hadoop fs -mkdir /tpch/supplier
echo "mkdir supplier"

$HADOOP_HOME/bin/hadoop fs -rm /tpch/customer/customer.tbl
$HADOOP_HOME/bin/hadoop fs -copyFromLocal customer.tbl /tpch/customer/
echo "customer"

$HADOOP_HOME/bin/hadoop fs -rm /tpch/lineitem/lineitem.tbl
$HADOOP_HOME/bin/hadoop fs -copyFromLocal lineitem.tbl /tpch/lineitem/
echo "lineitem"

$HADOOP_HOME/bin/hadoop fs -rm /tpch/nation/nation.tbl
$HADOOP_HOME/bin/hadoop fs -copyFromLocal nation.tbl /tpch/nation/
echo "nation"

$HADOOP_HOME/bin/hadoop fs -rm /tpch/orders/orders.tbl
$HADOOP_HOME/bin/hadoop fs -copyFromLocal orders.tbl /tpch/orders/
echo "orders"

$HADOOP_HOME/bin/hadoop fs -rm /tpch/part/part.tbl
$HADOOP_HOME/bin/hadoop fs -copyFromLocal part.tbl /tpch/part/
echo "part"

$HADOOP_HOME/bin/hadoop fs -rm /tpch/partsupp/partsupp.tbl
$HADOOP_HOME/bin/hadoop fs -copyFromLocal partsupp.tbl /tpch/partsupp/
echo "partsupp"

$HADOOP_HOME/bin/hadoop fs -rm /tpch/region/region.tbl
$HADOOP_HOME/bin/hadoop fs -copyFromLocal region.tbl /tpch/region/
echo "region"

$HADOOP_HOME/bin/hadoop fs -rm /tpch/supplier/supplier.tbl
$HADOOP_HOME/bin/hadoop fs -copyFromLocal supplier.tbl /tpch/supplier/
echo "supplier"

2 Map reduce dbgen wrapper

A more sophisticate way to generate TPCH data is to redirect dbgen output to HDFS using Map Reduce. This way, data is directly transferred from dbgen output to HDFS store.

To generate the data-sets, you need to run (say, for scale = 200, parallelism = 100)

Copy
$ hadoop  jar tpch-gen.jar   -d /user/hive/external/200/ -p 100 -s 200

This uses the existing parallelism in the dbgen program without modification and uses it to run the command on multiple machines.

The command generates multiple files for each map task, resulting in each table having its own subdirectory.

2.1 Map reduce dbgen wrapper

TO DO

This section is incomplete and will be concluded as soon as possible.