the-stats-duck v0.6.0: Run Statistics Directly in DuckDB
the-stats-duck is an open-source DuckDB extension that brings statistical analysis into SQL. Version 0.6.0, dubbed "i-m-not-dead", adds regression, bootstrap, new distributions, and a plot grammar. It's MIT-licensed and runs anywhere DuckDB does, including in the browser via WebAssembly.
Regression with R-Style Formulas
You can now fit ordinary least squares models using the lm() and lm_summary() table functions. The formula syntax mirrors R: body_mass_g ~ flipper_length_mm + bill_length_mm. lm() returns coefficient estimates, standard errors, t-statistics, and p-values. lm_summary() gives model-level metrics: R², adjusted R², F-statistic, and residual degrees of freedom.
SELECT *
FROM lm_summary(
'penguins',
formula := 'body_mass_g ~ flipper_length_mm + bill_length_mm'
);
Under the hood, it uses a Cholesky decomposition of X'X, with complete-case filtering. The results were cross-verified against R on the cars and mtcars datasets to four decimal places.
Bootstrap Confidence Intervals
For non-parametric inference, use the bootstrap() function. It resamples with replacement and returns a list of the chosen statistic (mean, median, stddev, etc.). Combine with list_aggregate for percentile intervals. You can pass a seed for reproducibility.
WITH b AS (
SELECT bootstrap(body_mass_g, 'mean', 2000, 42) AS samples
FROM penguins
)
SELECT
list_aggregate(samples, 'quantile_cont', 0.025) AS lo,
list_aggregate(samples, 'quantile_cont', 0.975) AS hi
FROM b;
Plot Grammar: VISUALIZE ... DRAW
v0.6.0 introduces a small plot grammar (ggsql) that compiles to Vega-Lite. New marks include violin plots, 2-D facets (FACET BY row, col), and per-layer statistics like LOESS smoothing (STAT smooth).
VISUALIZE
bill_depth_mm AS x,
bill_length_mm AS y,
species AS color
FROM penguins
DRAW point
DRAW line STAT smooth
SCALE x ZERO false
SCALE x LABEL 'Bill Depth (mm)'
SCALE y ZERO false
SCALE y LABEL 'Bill Length (mm)'
TITLE 'Penguins Bill Depth VS Bill Length (Smooth)';
Three STAT modifiers ship: smooth (LOESS), summary (average per group), and identity (no-op). The regression mark draws a straight least-squares line per group.
Ten New Probability Distributions
The extension now includes d/p/q/r functions for negative binomial, hypergeometric, Weibull, log-normal, and Poisson distributions. Each function was cross-checked against R to six decimal places. Example: Poisson PMF as a bar chart.
WITH pois AS (
SELECT k, dpois(k, 3) AS pmf
FROM range(0, 11) AS t(k)
)
VISUALIZE
k AS x,
pmf AS y
FROM pois
DRAW bar;
52x Faster XPT Reading
Reading SAS/SPSS/Stata files (read_stat) got a major speedup. The old implementation re-parsed the file from zero for every 2,048-row chunk, resulting in O(N²) time. v0.6.0 parses the file once into a buffered column store. Benchmarks:
- 200k-row, 7 MB XPT: 67 s → 1.3 s (52x)
- CDISC pilot qs.xpt (122k rows): 39 s → 1.1 s (36x)
- 1.6M rows: 70 min → 15 s
Getting Started
the-stats-duck is a DuckDB extension. Install it via DuckDB's INSTALL and LOAD commands, or use it inside Bedevere (a browser-based SQL notebook). Full release notes are on GitHub.
If you're doing statistical analysis in SQL, this extension removes the need to round-trip data to R or Python. Try it with the Palmer Penguins dataset—all examples run live in the browser at kolistat.com.

