The TPCH database and dbgen data generation utility, courtesy of www.tpc.org, were developed to provide an approach to benchmarking and include:

  • The tpch Database structure
  • The tpch dbgen utility, a utility to populate the database with a specified amount of data (Scale Factor)
  • The tpch benchmark queries, a set of pre-defined data warehouse queries to run against the database

We will show the details of the creation of the tpch database and it's population using the dbgen utility to generate data.

In essence, the schema consists of 8 tables, 8 explicit unique indexes supporting 8 primary keys and 9 explicit indexes supporting 9 foreign keys.

1 Download dbgen

The tpch dbgen utility generates, by default, a set of flat files suitable for loading into the tpch schema with the size based on the “Scale Factor” argument. A scale factor of 1 produces a complete data set of approximately 1 GB, a scale factor of 10 produces a data set of approximately 10 GB etc.

Download the dbgen source code:

Copy
$ git clone https://github.com/electrum/tpch-dbgen.git
Cloning into 'tpch-dbgen'...
remote: Counting objects: 149, done.
remote: Total 149 (delta 0), reused 0 (delta 0), pack-reused 149
Receiving objects: 100% (149/149), 216.15 KiB | 202.00 KiB/s, done.
Resolving deltas: 100% (30/30), done.
Checking connectivity... done.
You need to have git and gcc compiler installed on your machine.

2 Compile dbgen

In the downloaded directory (tpch-dbgen), edit the file makefile.suite and set the following variables to the appropriate vaules:

Copy
CC=gcc
DATABASE=INFORMIX
MACHINE=LINUX
WORKLOAD=TPCH

The run the make utility:

Copy
$ make -f makefile.suite
gcc -g -DDBNAME=\"dss\" -DMAC -DINFORMIX -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64    -c -o build.o build.c
gcc -g -DDBNAME=\"dss\" -DMAC -DINFORMIX -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64    -c -o driver.o driver.c
...

3 Test dbgen

Now you are ready to generate the tpch files

  • Change to the appropriate directory where you want to generate tpch files. For example, create a subdirectory under the tpch-dbgen directory.
    Copy
    $ mkdir data
    $ cd data
  • Copy the dbgen executable file and dists.dss file there.
    Copy
    $ cp ../dbgen .
    $ cp ../dists.dss .
  • Run dbgen for the appropriate database size factor (1GB in the sample).
    Copy
    ./dbgen -s 1
  • Generation may take a while. When completed, you can see the resulting files.
    Copy
    $ ls -l
    total 2150000
    -rw-r--r--  1 deister  staff   24346144 13 may 12:05 customer.tbl
    -rw-r--r--  1 deister  staff  759863287 13 may 12:05 lineitem.tbl
    -rw-r--r--  1 deister  staff       2224 13 may 12:05 nation.tbl
    -rw-r--r--  1 deister  staff  171952161 13 may 12:05 orders.tbl
    -rw-r--r--  1 deister  staff   24135125 13 may 12:05 part.tbl
    -rw-r--r--  1 deister  staff  118984616 13 may 12:05 partsupp.tbl
    -rw-r--r--  1 deister  staff        389 13 may 12:05 region.tbl
    -rw-r--r--  1 deister  staff    1409184 13 may 12:05 supplier.tbl

As a sample, generation of TPCH scale 10 on an Intel NUC i7-8550U, 1.9Ghz with NVME disk takes 2 minutes and it's load takes 5 minutes

4 Scale factor

Database will be sized according the selected scale factor.

4.1 Table sizes

Table Factor Row size
lineitem 6000000 146
orders 1500000 136
partsupp 800000 221
part 200000 168
customer 150000 227
supplier 10000 200

4.2 Number of rows according scale

Table (rows)
TPCH Scale (GB) lineitem orders partsupp part customer supplier
10 60M 15M 8M 2M 1.5M 0.1M
20 120M 30M 16M 4M 3M 0.2M
50 300M 75M 40M 10M 7.5M 0.5M
100 600M 150M 80M 20M 15M 1M
200 1.2B 300M 160M 40M 30M 2M
1000 6B 1.5B 800M 200M 150M 10M