NanoEuler: Build GPT-2 from Scratch in C/CUDA
A developer named JustVugg released NanoEuler, a GPT-2-class language model built entirely from scratch in C and CUDA. No PyTorch, no autograd, no ML libraries. The forward and backward passes are hand-written and verified, and the whole training pipeline lives in a single GitHub repo: a hand-written byte-level BPE tokenizer, pretraining on a books + web corpus, and supervised fine-tuning into a chat model. RLHF/DPO is planned.
Architecture Details
The model is a decoder-only transformer with modern building blocks:
- RMSNorm (pre-norm, no bias)
- Rotary position embeddings (RoPE) applied to queries and keys
- SwiGLU feed-forward:
down(silu(gate(x)) * up(x)) - Grouped-query attention (GQA): query heads share a smaller set of key/value heads
- Multi-token prediction (MTP): K output heads predict the next K tokens; the auxiliary heads improve the learned representation and enable speculative decoding. Generation uses head 0.
- No biases anywhere.
- Byte-level BPE tokenizer with GPT-2-style pretokenization (a single leading space attaches to the following word). Merges are learned on a sample of the corpus; the GPU model uses a 4096-token vocabulary (~3.4 bytes/token on English).
Each residual block is x = x + attn(rmsnorm(x)) followed by x = x + swiglu(rmsnorm(x)). The project name comes from the observation that a residual connection x = x + f(x) is exactly the forward-Euler method for an ODE dx/dt = f(x) with step size 1.
Configurations
The repo provides two configurations:
- Small (CPU):
dim=128,q/kv heads=4/2,layers=4,context=128,vocab=512, ~1.05M parameters. Trains in a few hours on 12 cores. - GPU pipeline:
dim=768,q/kv heads=12/4,layers=16,context=512,vocab=4096, ~116M parameters. Trains on a single RTX 4070.
The head size is 64 (768/12), which fits the FlashAttention kernel.
Verified Backward Pass
Every analytic gradient is compared against a central finite difference in double precision. The check runs with make check and outputs max relative errors for each tensor:
tok : max rel err 1.02e-04
qkvw : max rel err 7.20e-07
gatew : max rel err 6.86e-08
...
max relative error: 1.02e-04
>>> backward OK (error < 1e-2)
Every parameter tensor is checked, including the less obvious backward passes of RoPE, SwiGLU, GQA, and MTP.
GPU Engine (CUDA)
The CUDA engine in cuda/nanoeuler_cuda.cu is a full from-scratch port — forward, backward, training, and inference on the GPU. Every kernel is validated on the device against a CPU reference, and the whole model has a GPU gradient check (GPU grads vs CPU grads to ~1e-6).
Kernels: matmul (delegated to cuBLAS with TF32 tensor cores), RMSNorm, RoPE, grouped-query attention with a hand-written FlashAttention (tiled, online softmax, no T×T matrix in memory), SwiGLU, softmax/cross-entropy, and AdamW. FlashAttention made the training step about 3× faster.
Build command (RTX 40-series = Ada = sm_89):
cd cuda
nvcc -O3 -arch=sm_89 -Xcompiler -fno-tree-reassoc,-fno-tree-copy-prop nanoeuler_cuda.cu -o nanoeuler_cuda -lcublas
Modes:
./nanoeuler_cuda— run all kernel self-tests (GPU vs CPU)./nanoeuler_cuda g— full-model gradient check (GPU grads vs CPU)./nanoeuler_cuda t— pretrain from scratch, checkpoint to../nanoeuler.binevery 5k steps./nanoeuler_cuda tr— resume pretraining from the latest checkpoint./nanoeuler_cuda i "It was"— autoregressive generation on GPU./nanoeuler_cuda s— supervised fine-tune on Alpaca, save../nanoeuler_chat.bin./nanoeuler_cuda c— interactive chat with the fine-tuned model
Chat Pipeline
The chat pipeline is two stages. First pretrain the ~116M base on the books + web mix (./nanoeuler_cuda t). Then supervised fine-tuning turns it into an assistant: ./nanoeuler_cuda s loads the pretrained base, renders each Alpaca example with the standard instruction template, and trains with the loss masked to the response tokens only. The result is saved to nanoeuler_chat.bin; ./nanoeuler_cuda c then wraps each line you type in the same template and samples a reply.
After fine-tuning, the model answers in the right shape — it follows the instruction→response format, writes complete sentences, and stops on its own. The content, though, is shallow and often wrong: this is a small model trained on a single GPU, so it has little world knowledge to express. SFT teaches the model how to respond, not what it knows.
Data
Pretraining uses a real books + web mix:
- Books —
data/get_gutenberg.shdownloads ~95 public-domain Project Gutenberg classics (Austen, Dickens, Dostoevsky, Tolstoy, Melville, the complete Shakespeare, ...). Each book's license header/footer is stripped. - Web —
data/get_web.shpulls a slice of FineWeb-Edu (high-quality educational web text) from Hugging Face parquet files using the DuckDB CLI (a single static binary — no Python, no libraries).
Then concatenate them into the pretraining corpus:
sh data/get_gutenberg.sh # books -> data/gutenberg.txt
sh data/get_web.sh # web -> data/web.txt (~1 GB by default)
cat data/gutenberg.txt data/web.txt > data/pretrain.txt
sh data/get_alpaca.sh # instruction data for SFT -> data/alpaca.json
Roadmap
- ✅ Hand-written byte-level BPE with GPT-2-style pretokenization.
- ✅ From-scratch CUDA engine (cuBLAS + FlashAttention), validated by a full-model gradient check.
- ✅ Pretraining on a books + web mix, with checkpoint/resume.
- ✅ Supervised fine-tuning (Alpaca) with response-masked loss → a chat model.
- ⏳ DPO (preference optimization) — the alignment stage, next to build.
- ⏳ Scale the model and data (toward ~270M) and publish a trained checkpoint people can try.
Why It Matters
NanoEuler is a complete, understandable training pipeline for a decoder-only transformer, from tokenizer to fine-tuned chat model, with no external ML dependencies. It's a goldmine for developers who want to understand every piece of a modern language model — from gradient computation to CUDA kernel design. If you've ever felt that PyTorch's autograd hides too much, this is the antidote.
