TPC-H is the classic decision-support benchmark, modeling a
traditional enterprise data warehouse: a focused, 8-table schema
for a wholesale parts supplier, stressed by 22 analytical queries full of joins,
aggregations, and correlated subqueries. Its scale factor runs from a laptop-sized
sf=1 (≈1 GB) all the way to sf=100000 (100 TB) —
but the schema stays compact and relational, which is exactly the point: it
exercises a query optimizer the way a classic star/snowflake warehouse
does.
This demo runs all 22 queries against a genuine (browser-sized) TPC-H dataset using the real spark-rust Spark SQL engine compiled to WebAssembly — entirely in your browser tab. This page explains what you're looking at.
The schema at a glance
TPC-H is a snowflake schema: one big fact table
(lineitem) at the center, fanning out through orders
to customers, and through partsupp to parts and suppliers, with
geography (nation → region) on the edge. Every arrow
below is a foreign-key join — read it as "many … point to one …".
Schema · foreign-key relationships
flowchart TB
LINEITEM["📦 lineitem
the fact · ≈6M @ sf=1"]
ORDERS["orders
≈1.5M @ sf=1"]
CUSTOMER["customer
≈150K @ sf=1"]
PARTSUPP["partsupp
≈800K @ sf=1"]
PART["part
≈200K @ sf=1"]
SUPPLIER["supplier
≈10K @ sf=1"]
NATION["🌍 nation
25 · fixed"]
REGION["🌍 region
5 · fixed"]
LINEITEM -->|l_orderkey| ORDERS
LINEITEM -->|l_partkey
l_suppkey| PARTSUPP
ORDERS -->|o_custkey| CUSTOMER
PARTSUPP -->|ps_partkey| PART
PARTSUPP -->|ps_suppkey| SUPPLIER
CUSTOMER -->|c_nationkey| NATION
SUPPLIER -->|s_nationkey| NATION
NATION -->|n_regionkey| REGION
classDef fact fill:#1f6feb,stroke:#4a8bf5,color:#fff,font-weight:600;
classDef geo fill:#1b3a2a,stroke:#2f6f49,color:#d7f0df;
classDef dim fill:#222b38,stroke:#3a4658,color:#e6edf3;
class LINEITEM fact;
class REGION,NATION geo;
class CUSTOMER,ORDERS,PART,PARTSUPP,SUPPLIER dim;
Row counts above are at the reference scale sf=1 (≈1 GB). Every
table scales linearly, so the shape never changes — only the size: at
sf=100 that lineitem is ~600M rows, at sf=1000
~6 billion, and the browser demo's generated tier uses a tiny
sf=0.01 slice. Only region (5) and nation (25)
are fixed-cardinality at every scale.
The eight tables
| Table | Grain (one row per…) | Key columns |
|---|---|---|
lineitem | line on an order (the fact) | l_orderkey, l_partkey, l_suppkey |
orders | customer order | o_orderkey → o_custkey |
customer | customer | c_custkey → c_nationkey |
part | part in the catalog | p_partkey |
supplier | supplier | s_suppkey → s_nationkey |
partsupp | part stocked by a supplier | ps_partkey + ps_suppkey |
nation | nation (25, fixed) | n_nationkey → n_regionkey |
region | region (5, fixed) | r_regionkey |
The 22 queries, by theme
TPC-H's queries were hand-designed as optimizer choke points — each one targets a specific hard problem. The chart shows the SQL shapes the 22 queries lean on (a query can carry several tags): nearly every query aggregates, and two-thirds wrap a subquery.
Rather than march 1→22, here they are grouped by what they actually stress:
SQL shapes · 22 queries
Pricing & revenue roll-ups
Single- or few-table scans with heavy arithmetic aggregation over
lineitem. Q1 groups every line by return-flag/status;
Q6 is a pure filtered SUM (the simplest query — a
forecasting "what-if" on a discount window); Q14 measures the
promotional-revenue share. These reward fast scans and fixed-point decimal math.
Top-N shipping & ordering
Multi-way joins funneling toward a ranked, limited result.
Q3 finds the top unshipped orders by revenue for a market
segment; Q5 totals revenue by nation within a region;
Q10 ranks customers by lost revenue on returns.
Q4 counts late-committed orders with an EXISTS.
Supply chain & parts
Queries that walk the part–partsupp–supplier
side of the schema. Q2 finds the minimum-cost supplier for a part
in a region; Q11 flags high-value stock held by a nation's suppliers;
Q20 hunts suppliers with excess stock of "forest" parts. These lean on
correlated subqueries and the two-column partsupp key.
Trade, profit & market share
The widest joins in the set — fact joined to both the customer and the supplier geography at once. Q7 measures trade revenue between two nations; Q8 a nation's market share for a part type; Q9 product-type profit by nation and year. Stress tests for join ordering.
Correlated subqueries & behavior
The optimizer killers. Q17/Q18 compare a line
against an average computed in a correlated subquery; Q21 finds
suppliers who kept multi-supplier orders waiting (nested EXISTS/NOT
EXISTS); Q22 is a global "no recent orders" anti-join.
Q15 uses a view (the top supplier by quarterly revenue).
Reporting & distribution
Q12 correlates ship-mode with order priority for late deliveries; Q13 profiles how many orders customers place (an outer join + group-of-groups distribution). Classic "business intelligence" report shapes.
Why every query returns rows here. TPC-H's compact relational
schema has no fixed-size mega-dimensions and no multi-channel self-joins, so its
data stays dense even at a small scale factor — every table scales linearly and
the 22 queries use broad predicates, so a browser-sized sf=0.01
dataset already returns rows for all 22. (TPC-DS, the web-scale sibling, is the
opposite: deliberately sparse and selective — see its guide for why some of
its queries are empty at low scale.) Pick any query in the gallery and
hit Run.
Try it
Head back to the demo, pick a dataset tier (or the zero-download Generated tier), and click a query — or Run all 22 to watch the whole suite execute in the wasm sandbox. The SQL viewer shows each query's exact text; the result panel shows real rows computed locally, with nothing leaving your tab.