NanoEuler: Build GPT-2 from Scratch in C/CUDA

A developer named JustVugg released NanoEuler, a GPT-2-class language model built entirely from scratch in C and CUDA. No PyTorch, no autograd, no ML libraries. The forward and backward passes are hand-written and verified, and the whole training pipeline lives in a single GitHub repo: a hand-written byte-level BPE tokenizer, pretraining on a books + web corpus, and supervised fine-tuning into a chat model. RLHF/DPO is planned.

Architecture Details

The model is a decoder-only transformer with modern building blocks:

  • RMSNorm (pre-norm, no bias)
  • Rotary position embeddings (RoPE) applied to queries and keys
  • SwiGLU feed-forward: down(silu(gate(x)) * up(x))
  • Grouped-query attention (GQA): query heads share a smaller set of key/value heads
  • Multi-token prediction (MTP): K output heads predict the next K tokens; the auxiliary heads improve the learned representation and enable speculative decoding. Generation uses head 0.
  • No biases anywhere.
  • Byte-level BPE tokenizer with GPT-2-style pretokenization (a single leading space attaches to the following word). Merges are learned on a sample of the corpus; the GPU model uses a 4096-token vocabulary (~3.4 bytes/token on English).

Each residual block is x = x + attn(rmsnorm(x)) followed by x = x + swiglu(rmsnorm(x)). The project name comes from the observation that a residual connection x = x + f(x) is exactly the forward-Euler method for an ODE dx/dt = f(x) with step size 1.

Configurations

The repo provides two configurations:

  • Small (CPU): dim=128, q/kv heads=4/2, layers=4, context=128, vocab=512, ~1.05M parameters. Trains in a few hours on 12 cores.
  • GPU pipeline: dim=768, q/kv heads=12/4, layers=16, context=512, vocab=4096, ~116M parameters. Trains on a single RTX 4070.

The head size is 64 (768/12), which fits the FlashAttention kernel.

Verified Backward Pass

Every analytic gradient is compared against a central finite difference in double precision. The check runs with make check and outputs max relative errors for each tensor:

tok      : max rel err 1.02e-04
qkvw     : max rel err 7.20e-07
gatew    : max rel err 6.86e-08
...
max relative error: 1.02e-04
>>> backward OK (error < 1e-2)

Every parameter tensor is checked, including the less obvious backward passes of RoPE, SwiGLU, GQA, and MTP.

GPU Engine (CUDA)

The CUDA engine in cuda/nanoeuler_cuda.cu is a full from-scratch port — forward, backward, training, and inference on the GPU. Every kernel is validated on the device against a CPU reference, and the whole model has a GPU gradient check (GPU grads vs CPU grads to ~1e-6).

Kernels: matmul (delegated to cuBLAS with TF32 tensor cores), RMSNorm, RoPE, grouped-query attention with a hand-written FlashAttention (tiled, online softmax, no T×T matrix in memory), SwiGLU, softmax/cross-entropy, and AdamW. FlashAttention made the training step about 3× faster.

Build command (RTX 40-series = Ada = sm_89):

cd cuda
nvcc -O3 -arch=sm_89 -Xcompiler -fno-tree-reassoc,-fno-tree-copy-prop nanoeuler_cuda.cu -o nanoeuler_cuda -lcublas

Modes:

  • ./nanoeuler_cuda — run all kernel self-tests (GPU vs CPU)
  • ./nanoeuler_cuda g — full-model gradient check (GPU grads vs CPU)
  • ./nanoeuler_cuda t — pretrain from scratch, checkpoint to ../nanoeuler.bin every 5k steps
  • ./nanoeuler_cuda tr — resume pretraining from the latest checkpoint
  • ./nanoeuler_cuda i "It was" — autoregressive generation on GPU
  • ./nanoeuler_cuda s — supervised fine-tune on Alpaca, save ../nanoeuler_chat.bin
  • ./nanoeuler_cuda c — interactive chat with the fine-tuned model

Chat Pipeline

The chat pipeline is two stages. First pretrain the ~116M base on the books + web mix (./nanoeuler_cuda t). Then supervised fine-tuning turns it into an assistant: ./nanoeuler_cuda s loads the pretrained base, renders each Alpaca example with the standard instruction template, and trains with the loss masked to the response tokens only. The result is saved to nanoeuler_chat.bin; ./nanoeuler_cuda c then wraps each line you type in the same template and samples a reply.

After fine-tuning, the model answers in the right shape — it follows the instruction→response format, writes complete sentences, and stops on its own. The content, though, is shallow and often wrong: this is a small model trained on a single GPU, so it has little world knowledge to express. SFT teaches the model how to respond, not what it knows.

Data

Pretraining uses a real books + web mix:

  • Booksdata/get_gutenberg.sh downloads ~95 public-domain Project Gutenberg classics (Austen, Dickens, Dostoevsky, Tolstoy, Melville, the complete Shakespeare, ...). Each book's license header/footer is stripped.
  • Webdata/get_web.sh pulls a slice of FineWeb-Edu (high-quality educational web text) from Hugging Face parquet files using the DuckDB CLI (a single static binary — no Python, no libraries).

Then concatenate them into the pretraining corpus:

sh data/get_gutenberg.sh                       # books  -> data/gutenberg.txt
sh data/get_web.sh                             # web    -> data/web.txt (~1 GB by default)
cat data/gutenberg.txt data/web.txt > data/pretrain.txt
sh data/get_alpaca.sh                          # instruction data for SFT -> data/alpaca.json

Roadmap

  • ✅ Hand-written byte-level BPE with GPT-2-style pretokenization.
  • ✅ From-scratch CUDA engine (cuBLAS + FlashAttention), validated by a full-model gradient check.
  • ✅ Pretraining on a books + web mix, with checkpoint/resume.
  • ✅ Supervised fine-tuning (Alpaca) with response-masked loss → a chat model.
  • ⏳ DPO (preference optimization) — the alignment stage, next to build.
  • ⏳ Scale the model and data (toward ~270M) and publish a trained checkpoint people can try.

Why It Matters

NanoEuler is a complete, understandable training pipeline for a decoder-only transformer, from tokenizer to fine-tuned chat model, with no external ML dependencies. It's a goldmine for developers who want to understand every piece of a modern language model — from gradient computation to CUDA kernel design. If you've ever felt that PyTorch's autograd hides too much, this is the antidote.