Sophon PFG-1: 330GB On-Die DRAM ASIC Eliminates HBM, Deliver

Sophon PFG-1: 330GB On-Die DRAM ASIC Eliminates HBM, Delivers 14,438 Tokens/s

PhantaField's PFG-1 'Sophon' is a monolithic-3D AI ASIC with 330GB of on-die 2T0C DRAM, achieving 4,200 TFLOPS FP8 and 14,438 tokens/s for 80B models. It eliminates HBM entirely, offering 174x better tokens-per-watt than NVIDIA Rubin at low batch.

3 min readJun 29, 2026

Sophon PFG-1: 330GB On-Die DRAM ASIC Eliminates HBM, Delivers 14,438 Tokens/s

The Memory Wall Is Dead

PhantaField's PFG-1 "Sophon" is a monolithic-3D AI ASIC that packs 330 GB of on-die DRAM and delivers 4,200 TFLOPS FP8 (2,100 TFLOPS BF16) in a 750 mm² die. It eliminates HBM entirely, achieving 14,438 tokens/s FP8 decode for an 80B model at 373 W—174× better tokens-per-watt than an NVIDIA Rubin (R200) at low batch.

Architecture: 32 Tiers of TMD CMOS

Sophon stacks 32 logic tiers (MAC arrays) and 32 memory tiers (2T0C DRAM) on a 28 nm Si CMOS base. Each tier is 750 mm², fabricated with 2D Transition-Metal Dichalcogenide (TMD) transistors (MoS₂ n-FET, WSe₂ p-FET) at ≤ 450 °C. The total stack height is ~22 µm above the Si die. Monolithic Inter-tier Vias (MIVs) at 90 nm pitch provide 1.23×10⁸ connections/mm², though only ~5.5×10⁵/mm² are used.

2T0C Gain-Cell DRAM: No Capacitor Needed

The memory cell uses two TMD transistors and no capacitor (2T0C). The storage node relies on the gate capacitance of the read transistor (~2.5 fF) plus junction capacitance (~0.5 fF). TMD off-current density is 1 fA/µm (0.5 fA per cell), enabling retention of 1.8 seconds at 25 °C. Sophon refreshes every 1.0 second at 0.08 W. At 60 °C, retention drops to 159 ms, but refresh power stays under 4 W.

Compute: Pure Digital CIM

Each of the 131,072 tiles contains a 256×256 weight subarray, binary sense amplifier, and 8-level adder tree. Bit-serial activation broadcasts at 500 MHz (16 cycles for BF16, 8 for FP8). Per-tile energy is 0.620 pJ/MAC for BF16 forward, 0.940 pJ for forward+backward, and 0.310 pJ for FP8 inference. Peak efficiency is 3.72 TFLOPS/W (BF16 training average).

Performance: 14,438 Tokens/s for 80B Models

At 373 W, Sophon serves an 80B model at 7,219 tokens/s BF16 decode or 14,438 tokens/s FP8 decode. Training throughput is 2,406 tokens/s at 564 W average. With INT4 speculative decoding in FP8 mode, effective throughput reaches 72,188 tokens/s. Sophon's weight bandwidth is 4.2 PB/s per tile, yielding ~191–214× the weight bandwidth of an HBM4 package (22 TB/s for Rubin).

Economics: No HBM, Lower BOM

Morgan Stanley estimates an NVIDIA VR200 NVL72 rack at ~$7.8M, with HBM memory alone at $2.0M (25.7% of rack cost). Sophon's BOM is $8,358 per die—a ~9.9× cost reduction vs Rubin. The die loads weights once from NVMe at boot and retains them with ~3 W idle power.

Why It Matters

Sophon demonstrates that monolithic-3D with TMD transistors can overcome the HBM bandwidth wall. For inference serving at low batch, weight bandwidth—not compute FLOPS—is the bottleneck. Sophon's architecture makes every MAC its own memory controller, eliminating HBM's shared bus contention. If manufacturable at scale, this could reshape AI hardware economics.

Editor's Take

I've been skeptical of monolithic-3D claims since the 2010s, but PhantaField's numbers are startlingly specific—down to per-cell leakage and refresh power at 60°C. The 2T0C DRAM retention of 1.8 seconds is just enough for training workloads without constant refresh. If they can actually fab 32 tiers of TMD CMOS at 750 mm², this is the most exciting AI hardware I've seen since the TPU. I'd love to see independent benchmarks, but the physics checks out.

— DevDigest Editorial

Key Takeaways

•Sophon eliminates HBM, so developers won't need to optimize for memory bandwidth—compute becomes the only bottleneck.
•The 330 GB on-die capacity fits an 80B BF16 model with optimizer state, enabling single-die training without model parallelism.
•With 14,438 tokens/s FP8 decode, inference serving costs drop dramatically—potentially 10× cheaper per token than current HBM-based GPUs.

Why It Matters

For developers training or serving large language models, HBM bandwidth is the primary bottleneck. Sophon's on-die DRAM delivers 191× more weight bandwidth than HBM4, enabling 174× better tokens-per-watt. This could make AI inference and training far cheaper and more energy-efficient, especially for single-batch serving.

#hardware#DRAM#inference#training#AI accelerator

Get the weekly digest

Every Sunday - top tech stories, industry breakthroughs, and developer tools delivered to your inbox.

No spam, unsubscribe anytime.

Sophon PFG-1: 330GB On-Die DRAM ASIC Eliminates HBM, Delivers 14,438 Tokens/s

The Memory Wall Is Dead

Architecture: 32 Tiers of TMD CMOS

2T0C Gain-Cell DRAM: No Capacitor Needed

Compute: Pure Digital CIM

Performance: 14,438 Tokens/s for 80B Models

Economics: No HBM, Lower BOM

Why It Matters

Editor's Take

Key Takeaways

Why It Matters

Get the weekly digest

You might also like

GLM 5.2 Beats Claude Code in IDOR Detection at 1/6 the Cost

China's LineShine Supercomputer Tops TOP500 with 2.2 Exaflops

Optimizing on the Probability Simplex: PGD vs Softmax Reparameterization

AMD Strix Halo RDMA Cluster: Setup Guide for Distributed vLLM Inference

Git Isn't About Diffs: Fix Your Mental Model in 6 Steps

Google-InspectionTool UA Broke My Site's Indexing for Months