Rust Uptime Monitor: 130K Checks/sec per Core with Postgres

Uptimepage: One Binary, Two Databases, 130K Checks/sec per Core

Uptimepage is an open-source uptime monitor and status page built in Rust. It ships as a single 23 MB binary plus Postgres and ClickHouse. The design choices are worth studying: one process, two databases, and a custom HTTP client that sustains 130K checks per second per core.

Why One Binary?

A typical uptime monitor is a handful of services with a message bus. Uptimepage runs everything in one process: scheduler, probe workers, HTTP client, time-series writer, incident detector, alerting, web UI, and JSON API. No queue to operate, no version skew, no "which container is wedged" at 3am. The trade-off is careful thread isolation to prevent one bad target from stalling the whole system.

Stack: Rust 1.95 (edition 2024), Tokio, Axum, Askama for compile-time HTML templates, and HTMX for partial swaps. The API is the single source of truth — every UI mutation hits the same /api/v1/* endpoint a script would.

Two Databases, On Purpose

Monitors are low-cardinality relational data: targets, regions, channels, incidents, plans. That's Postgres. Check results are append-only, high-cardinality, and queried by time range. That's ClickHouse. Trying to force one into the other is where uptime monitors usually fail. Postgres for config, ClickHouse for the firehose. Both run their migrations at process startup.

The HTTP Client I Didn't Want to Write

The first version used a popular high-level HTTP client. It worked, but a monitor is a weird HTTP workload: connect once per target per interval, never reuse the connection, care about timing of each phase. So the author dropped down to hyper and hyper-util with rustls, building a connector that times DNS resolution, TCP connect, and TLS handshake as separate numbers. The request runs over hyper::client::conn and aborts the connection task the moment the body is read. Each result carries dns_ms, connect_ms, tls_ms, and ttfb_ms as distinct columns.

The rewrite paid for itself. On a single core, the client sustains around 130K checks a second at saturation — roughly 7.7 microseconds per check. That's a 44-56% throughput gain over the old path. A chunk came from removing a url::parse call hiding in the redirect policy that cost 7.5% on its own. Two cores get about 153K. Scaling goes sub-linear past four cores due to shared HTTP/2 connection state and the pool mutex.

One Heap, Not a Timer Per Target

The naive scheduler spawns a timer task per monitor. That falls apart at fleet size. Instead, there's a single driver task owning one BinaryHeap> — a min-heap keyed by the next due Instant for the whole fleet. Memory stays flat. Each target gets a deterministic jitter offset hashed from its UUID, so a thousand monitors on a 60-second interval don't all fire on the same tick. Generation and sequence counters prevent double-firing. The registry refresh runs on its own task with exponential backoff, so a Postgres hiccup never stalls dispatch.

Failure Isolation Inside One Process

Three patterns prevent one bad target from taking down the rest:

Per-host circuit breakers trip when a host keeps failing, failing fast with circuit_open instead of tying up a worker on a timeout.
Per-tenant host throttle bulkhead caps how many checks can be in flight against one host at once; over the cap, a check is recorded as throttled and degraded.
Singleflight on RDAP collapses domain-expiry checks for the same domain across many tenants into one upstream probe.

The worker pool is a task-per-dispatch gated by a semaphore, with an SSRF guard filtering resolved IPs before any connect.

Modeling Time Series in ClickHouse

The check_results table is a MergeTree ordered by (org_id, target_id, region, timestamp). The partition key is by day, so no millions of partitions. Storage savings:

Timestamps use CODEC(DoubleDelta, ZSTD(1)) — check intervals are near-constant, so DoubleDelta crushes gaps.
Numeric phase columns use CODEC(T64, ZSTD(1)).
region and agent_id are LowCardinality(String), status is an Enum8.
Retention is per row with a ttl_days column and TTL timestamp + toIntervalDay(ttl_days).

Two AggregatingMergeTree materialized views provide per-minute and hourly rollups holding quantilesState for p50/p95/p99 and per-status counts. Reads route by range: anything inside 30 days hits the minute rollup, older ranges hit the hour rollup, raw reads capped at 90 days.

Regional Probes Without a Second Brain

An agent is the same binary in agent mode — stateless probe, no database, no web, no alerting. Adding a region adds execution capacity, never a second control plane. Agents pull config with ETag handling, serve last-known config if the control plane blips, and pause if token is revoked. Results ship in batches that reuse one UUID across retries so a lost ack can't double-count. Region and agent identity are derived server-side from the bearer token, never sent in the payload.

Incidents as a Follower

The incident detector is a background task that follows the check_results stream and writes into Postgres. It never touches the hot write path, never gates check execution, and never produces alerts directly. Detection is boring: two or more consecutive unhealthy results with no open incident opens one; two or more consecutive healthy results closes it. A unique index on open incidents resolves races. Opening or resolving fires a non-blocking signal to the escalation engine, which does repeat-until-acknowledged paging across about fourteen transports, with sharded per-incident locks. Channel secrets are sealed at rest with AES-GCM.

Automation as a First-Class Surface

Because the API is the single source of truth, the rest came almost for free: a self-describing OpenAPI spec with Swagger UI, an official Terraform provider, and an MCP server for LLM clients over OAuth.

Where It Is

Uptimepage is live, free to start with no card, and AGPL-3.0 open source. The core is not paywalled: checks, status pages, subscribers, the API, and every alert channel are in the free tier. The source is on GitHub.

Key Takeaways

Single binary simplifies ops but requires careful thread isolation.
Postgres + ClickHouse avoids impedance mismatch between config and time-series data.
Custom HTTP client using hyper directly yields 44-56% throughput gain over high-level clients.
BinaryHeap scheduler keeps memory flat regardless of fleet size.
ClickHouse schema design with DoubleDelta and T64 codecs saves significant storage.

If you're building a high-throughput monitoring system, study Uptimepage's architecture. The source is on GitHub under AGPL-3.0.

Rust Uptime Monitor: 130K Checks/sec per Core with Postgres and ClickHouse

Uptimepage: One Binary, Two Databases, 130K Checks/sec per Core

Why One Binary?

Two Databases, On Purpose

The HTTP Client I Didn't Want to Write

One Heap, Not a Timer Per Target

Failure Isolation Inside One Process

Modeling Time Series in ClickHouse

Regional Probes Without a Second Brain

Incidents as a Follower

Automation as a First-Class Surface

Where It Is

Key Takeaways

Editor's Take

Key Takeaways

Why It Matters

Get the weekly digest

TypeScript 4.9's satisfies: 5 Patterns You're Missing in 2026

Google's TabFM: Zero-shot tabular classification without training