14× Faster Embeddings: How Manticore Rebuilt the ONNX Path

Manticore Search 27.1.5 ships a new ONNX Runtime backend for auto-embeddings that delivers ~14× the throughput of the previous SentenceTransformers/Candle path. On a 16-core/32-thread server with all-MiniLM-L12-v2, the old path managed 5–11 docs/sec across all thread and batch configurations. The new path lives in the 70–230 docs/sec range.

Why ONNX Runtime?

The old Candle path (Hugging Face's pure-Rust inference) left CPU on the floor: workloads sat in low-double-digit docs/sec, and concurrent calls serialised on a single model session. ONNX Runtime (ORT) — Microsoft's hand-tuned C++ inference engine — does graph fusion, constant folding, and kernel autotuning. Most popular embedding models (MiniLM, BGE, E5) already publish a pre-fused model.onnx in their HuggingFace directory.

The Session Sharing Hack

The key insight: ORT's C Run() API is thread-safe on Linux and macOS. The Rust wrapper hides this behind borrow-checker rules. Manticore wraps the session in an UnsafeCell and implements Sync/Send manually:

#[cfg(not(target_os = "windows"))]
struct SessionWrapper {
    inner: std::cell::UnsafeCell,
}
#[cfg(not(target_os = "windows"))]
unsafe impl Sync for SessionWrapper {}
#[cfg(not(target_os = "windows"))]
unsafe impl Send for SessionWrapper {}

impl SessionWrapper {
    fn with_session(&self, f: impl FnOnce(&mut Session) -> R) -> R {
        f(unsafe { &mut *self.inner.get() })
    }
}

This single shared session eliminates lock contention and pool overhead. On Windows, a Mutex serialises access due to known ORT threading issues.

Batching Was a Trap

Textbook advice says batch inputs for throughput. Manticore tried batching 8, 16, 32 documents per inference call — and got lower throughput than processing one at a time. Two reasons:

  1. Padding tax: A batch of mixed-length texts pads every row to the longest. Real inputs vary wildly: one 60-token outlier forces seven 8-token rows to pay for padding. The model does work proportional to batch_size * max_len * hidden_dim, most of it on padding.

  2. Spinning: ORT's intra-op thread pool defaults to busy-waiting between dispatches. With one big batch, threads stay busy. With many concurrent small calls, every worker's pool pins cores at 100% CPU — stealing resources from tokenizers, HNSW builds, and the rest of searchd. Flipping with_intra_op_spinning(false) immediately raised throughput and dropped CPU usage.

The Final Design

  • One shared session, no pool.
  • One document per inference call, no batching inside the worker.
  • Many concurrent callers, scaled to CPU count.
  • No spinning between calls — yield the CPU.

The predict_pipelined function has two branches:

fn predict_pipelined(&self, texts: &[&str]) -> Result>, _> {
    let bs = batch_size();
    if texts.len() <= bs {
        // Fast path: single tokenize + infer, no thread overhead
        return Self::tokenize_and_infer(&self.session, &self.tokenizer, texts, ...);
    }
    // Large input: split across workers, each running 1-doc-at-a-time
    // through the SHARED session
    let num_workers = (texts.len() / bs).min(available_cpus()).max(1);
    let docs_per_worker = texts.len().div_ceil(num_workers);
    std::thread::scope(|s| {
        for worker_texts in texts.chunks(docs_per_worker) {
            s.spawn(move || {
                for text in worker_texts {
                    Self::tokenize_and_infer(&session, &tokenizer,
                        std::slice::from_ref(text), ...)?;
                }
                Ok(())
            });
        }
    });
    // ...
}

Single-row INSERTs take the fast path with zero coordination overhead. Bulk REPLACE INTO takes the parallel branch.

Performance Numbers

All runs on a 16-core/32-thread server with all-MiniLM-L12-v2-onnx, 1000 documents per run.

ConfigurationOld Candle (docs/sec)New ONNX (docs/sec)
1 thread, batch=1572
1 thread, batch=6411233
8 threads, batch=18130
32 threads, batch=18100

Single-insert latency: ~14 ms with one client, ~56 ms under 8-way concurrent load (vs 200+ ms for Candle).

What Changed (and What Didn't)

No user-facing API changes. Tables pointing at an ONNX-capable model pick up the new path automatically. To switch models without recreating a table: add a new column with the new model, rebuild embeddings, drop the old column.

The two biggest performance wins: with_intra_op_spinning(false) and giving up on batching documents inside the worker. The reverted commit 980b24b marks the moment the team stopped fighting the profiler.

Why This Matters for Developers

Auto-embeddings run the model on every INSERT — embedding speed is ingest speed. The old path capped throughput at 5–11 docs/sec regardless of hardware. The new path raises the floor to 70+ docs/sec and gives meaningful tuning options (batch size, thread count). For bulk indexing, peak throughput hit 233 docs/sec on a single client thread with batch=64.

Manticore Search 27.1.5 is available now. If you're using auto-embeddings, upgrade and watch your INSERT throughput jump an order of magnitude.