The TPCH
database and dbgen data generation utility, courtesy of www.tpc.org,
were developed to provide an approach to benchmarking and include:
- The
tpch
Database structure - The
tpch
dbgen
utility, a utility to populate the database with a specified amount of data (Scale Factor) - The
tpch
benchmark queries, a set of pre-defined data warehouse queries to run against the database
We will show the details of the creation of the tpch database and it's population using the dbgen
utility to generate data.

1 Download dbgen
The tpch dbgen
utility generates, by default, a set of flat files suitable for loading into
the tpch
schema with the size based on the “Scale Factor” argument.
A scale factor of 1 produces a complete data set of approximately 1 GB,
a scale factor of 10 produces a data set of approximately 10 GB etc.
Download the dbgen
source code:
$ git clone https://github.com/electrum/tpch-dbgen.git
Cloning into 'tpch-dbgen'...
remote: Counting objects: 149, done.
remote: Total 149 (delta 0), reused 0 (delta 0), pack-reused 149
Receiving objects: 100% (149/149), 216.15 KiB | 202.00 KiB/s, done.
Resolving deltas: 100% (30/30), done.
Checking connectivity... done.
git
and gcc
compiler installed on your machine.
2 Compile dbgen
In the downloaded directory (tpch-dbgen), edit the file makefile.suite
and set
the following variables to the appropriate vaules:
CC=gcc DATABASE=INFORMIX MACHINE=LINUX WORKLOAD=TPCH
The run the make utility:
$ make -f makefile.suite
gcc -g -DDBNAME=\"dss\" -DMAC -DINFORMIX -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64 -c -o build.o build.c
gcc -g -DDBNAME=\"dss\" -DMAC -DINFORMIX -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64 -c -o driver.o driver.c
...
3 Test dbgen
Now you are ready to generate the tpch
files
-
Change to the appropriate directory where you want to generate
tpch
files. For example, create a subdirectory under thetpch-dbgen
directory.Copy$ mkdir data $ cd data
-
Copy the
dbgen
executable file anddists.dss
file there.Copy$ cp ../dbgen . $ cp ../dists.dss .
-
Run
dbgen
for the appropriate database size factor (1GB in the sample).Copy./dbgen -s 1
-
Generation may take a while. When completed, you can see the resulting files.
Copy
$ ls -l
total 2150000 -rw-r--r-- 1 deister staff 24346144 13 may 12:05 customer.tbl -rw-r--r-- 1 deister staff 759863287 13 may 12:05 lineitem.tbl -rw-r--r-- 1 deister staff 2224 13 may 12:05 nation.tbl -rw-r--r-- 1 deister staff 171952161 13 may 12:05 orders.tbl -rw-r--r-- 1 deister staff 24135125 13 may 12:05 part.tbl -rw-r--r-- 1 deister staff 118984616 13 may 12:05 partsupp.tbl -rw-r--r-- 1 deister staff 389 13 may 12:05 region.tbl -rw-r--r-- 1 deister staff 1409184 13 may 12:05 supplier.tbl
As a sample, generation of TPCH
scale 10 on an Intel NUC i7-8550U, 1.9Ghz with NVME disk takes 2 minutes and it's load takes 5 minutes
4 Scale factor
Database will be sized according the selected scale factor.
4.1 Table sizes
Table | Factor | Row size |
---|---|---|
lineitem | 6000000 | 146 |
orders | 1500000 | 136 |
partsupp | 800000 | 221 |
part | 200000 | 168 |
customer | 150000 | 227 |
supplier | 10000 | 200 |
4.2 Number of rows according scale
Table (rows) | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
TPCH Scale (GB) |
lineitem | orders | partsupp | part | customer | supplier | ||||||
10 | 60M | 15M | 8M | 2M | 1.5M | 0.1M | ||||||
20 | 120M | 30M | 16M | 4M | 3M | 0.2M | ||||||
50 | 300M | 75M | 40M | 10M | 7.5M | 0.5M | ||||||
100 | 600M | 150M | 80M | 20M | 15M | 1M | ||||||
200 | 1.2B | 300M | 160M | 40M | 30M | 2M | ||||||
1000 | 6B | 1.5B | 800M | 200M | 150M | 10M |