TPC-H — a guided tour · spark-rust-wasm

TPC-H is the classic decision-support benchmark, modeling a traditional enterprise data warehouse: a focused, 8-table schema for a wholesale parts supplier, stressed by 22 analytical queries full of joins, aggregations, and correlated subqueries. Its scale factor runs from a laptop-sized sf=1 (≈1 GB) all the way to sf=100000 (100 TB) — but the schema stays compact and relational, which is exactly the point: it exercises a query optimizer the way a classic star/snowflake warehouse does.

This demo runs all 22 queries against a genuine (browser-sized) TPC-H dataset using the real spark-rust Spark SQL engine compiled to WebAssembly — entirely in your browser tab. This page explains what you're looking at.

The schema at a glance

TPC-H is a snowflake schema: one big fact table (lineitem) at the center, fanning out through orders to customers, and through partsupp to parts and suppliers, with geography (nation → region) on the edge. Every arrow below is a foreign-key join — read it as "many … point to one …".

fact (the rows you aggregate) dimension / lookup geography

Schema · foreign-key relationships

flowchart TB
    LINEITEM["📦 lineitem
the fact · ≈6M @ sf=1"]
    ORDERS["orders
≈1.5M @ sf=1"]
    CUSTOMER["customer
≈150K @ sf=1"]
    PARTSUPP["partsupp
≈800K @ sf=1"]
    PART["part
≈200K @ sf=1"]
    SUPPLIER["supplier
≈10K @ sf=1"]
    NATION["🌍 nation
25 · fixed"]
    REGION["🌍 region
5 · fixed"]

    LINEITEM -->|l_orderkey| ORDERS
    LINEITEM -->|l_partkey
l_suppkey| PARTSUPP
    ORDERS -->|o_custkey| CUSTOMER
    PARTSUPP -->|ps_partkey| PART
    PARTSUPP -->|ps_suppkey| SUPPLIER
    CUSTOMER -->|c_nationkey| NATION
    SUPPLIER -->|s_nationkey| NATION
    NATION -->|n_regionkey| REGION

    classDef fact fill:#1f6feb,stroke:#4a8bf5,color:#fff,font-weight:600;
    classDef geo  fill:#1b3a2a,stroke:#2f6f49,color:#d7f0df;
    classDef dim  fill:#222b38,stroke:#3a4658,color:#e6edf3;
    class LINEITEM fact;
    class REGION,NATION geo;
    class CUSTOMER,ORDERS,PART,PARTSUPP,SUPPLIER dim;

Row counts above are at the reference scale sf=1 (≈1 GB). Every table scales linearly, so the shape never changes — only the size: at sf=100 that lineitem is ~600M rows, at sf=1000 ~6 billion, and the browser demo's generated tier uses a tiny sf=0.01 slice. Only region (5) and nation (25) are fixed-cardinality at every scale.

The eight tables

Table	Grain (one row per…)	Key columns
`lineitem`	line on an order (the fact)	`l_orderkey`, `l_partkey`, `l_suppkey`
`orders`	customer order	`o_orderkey` → `o_custkey`
`customer`	customer	`c_custkey` → `c_nationkey`
`part`	part in the catalog	`p_partkey`
`supplier`	supplier	`s_suppkey` → `s_nationkey`
`partsupp`	part stocked by a supplier	`ps_partkey` + `ps_suppkey`
`nation`	nation (25, fixed)	`n_nationkey` → `n_regionkey`
`region`	region (5, fixed)	`r_regionkey`

The 22 queries, by theme

TPC-H's queries were hand-designed as optimizer choke points — each one targets a specific hard problem. The chart shows the SQL shapes the 22 queries lean on (a query can carry several tags): nearly every query aggregates, and two-thirds wrap a subquery.

Rather than march 1→22, here they are grouped by what they actually stress:

SQL shapes · 22 queries

aggregate16

subquery14

IN-subquery3

CASE3

anti-join2

semi-join1

outer-join1

CTE / view1

Pricing & revenue roll-ups

Q1 · Q6 · Q14

Single- or few-table scans with heavy arithmetic aggregation over lineitem. Q1 groups every line by return-flag/status; Q6 is a pure filtered SUM (the simplest query — a forecasting "what-if" on a discount window); Q14 measures the promotional-revenue share. These reward fast scans and fixed-point decimal math.

Top-N shipping & ordering

Q3 · Q4 · Q5 · Q10

Multi-way joins funneling toward a ranked, limited result. Q3 finds the top unshipped orders by revenue for a market segment; Q5 totals revenue by nation within a region; Q10 ranks customers by lost revenue on returns. Q4 counts late-committed orders with an EXISTS.

Supply chain & parts

Q2 · Q11 · Q16 · Q19 · Q20

Queries that walk the part–partsupp–supplier side of the schema. Q2 finds the minimum-cost supplier for a part in a region; Q11 flags high-value stock held by a nation's suppliers; Q20 hunts suppliers with excess stock of "forest" parts. These lean on correlated subqueries and the two-column partsupp key.

Trade, profit & market share

Q7 · Q8 · Q9

The widest joins in the set — fact joined to both the customer and the supplier geography at once. Q7 measures trade revenue between two nations; Q8 a nation's market share for a part type; Q9 product-type profit by nation and year. Stress tests for join ordering.

Correlated subqueries & behavior

Q15 · Q17 · Q18 · Q21 · Q22

The optimizer killers. Q17/Q18 compare a line against an average computed in a correlated subquery; Q21 finds suppliers who kept multi-supplier orders waiting (nested EXISTS/NOT EXISTS); Q22 is a global "no recent orders" anti-join. Q15 uses a view (the top supplier by quarterly revenue).

Reporting & distribution

Q12 · Q13

Q12 correlates ship-mode with order priority for late deliveries; Q13 profiles how many orders customers place (an outer join + group-of-groups distribution). Classic "business intelligence" report shapes.

Why every query returns rows here. TPC-H's compact relational schema has no fixed-size mega-dimensions and no multi-channel self-joins, so its data stays dense even at a small scale factor — every table scales linearly and the 22 queries use broad predicates, so a browser-sized sf=0.01 dataset already returns rows for all 22. (TPC-DS, the web-scale sibling, is the opposite: deliberately sparse and selective — see its guide for why some of its queries are empty at low scale.) Pick any query in the gallery and hit Run.

Try it

Head back to the demo, pick a dataset tier (or the zero-download Generated tier), and click a query — or Run all 22 to watch the whole suite execute in the wasm sandbox. The SQL viewer shows each query's exact text; the result panel shows real rows computed locally, with nothing leaving your tab.