The Memory Wall Is Dead
PhantaField's PFG-1 "Sophon" is a monolithic-3D AI ASIC that packs 330 GB of on-die DRAM and delivers 4,200 TFLOPS FP8 (2,100 TFLOPS BF16) in a 750 mm² die. It eliminates HBM entirely, achieving 14,438 tokens/s FP8 decode for an 80B model at 373 W—174× better tokens-per-watt than an NVIDIA Rubin (R200) at low batch.
Architecture: 32 Tiers of TMD CMOS
Sophon stacks 32 logic tiers (MAC arrays) and 32 memory tiers (2T0C DRAM) on a 28 nm Si CMOS base. Each tier is 750 mm², fabricated with 2D Transition-Metal Dichalcogenide (TMD) transistors (MoS₂ n-FET, WSe₂ p-FET) at ≤ 450 °C. The total stack height is ~22 µm above the Si die. Monolithic Inter-tier Vias (MIVs) at 90 nm pitch provide 1.23×10⁸ connections/mm², though only ~5.5×10⁵/mm² are used.
2T0C Gain-Cell DRAM: No Capacitor Needed
The memory cell uses two TMD transistors and no capacitor (2T0C). The storage node relies on the gate capacitance of the read transistor (~2.5 fF) plus junction capacitance (~0.5 fF). TMD off-current density is 1 fA/µm (0.5 fA per cell), enabling retention of 1.8 seconds at 25 °C. Sophon refreshes every 1.0 second at 0.08 W. At 60 °C, retention drops to 159 ms, but refresh power stays under 4 W.
Compute: Pure Digital CIM
Each of the 131,072 tiles contains a 256×256 weight subarray, binary sense amplifier, and 8-level adder tree. Bit-serial activation broadcasts at 500 MHz (16 cycles for BF16, 8 for FP8). Per-tile energy is 0.620 pJ/MAC for BF16 forward, 0.940 pJ for forward+backward, and 0.310 pJ for FP8 inference. Peak efficiency is 3.72 TFLOPS/W (BF16 training average).
Performance: 14,438 Tokens/s for 80B Models
At 373 W, Sophon serves an 80B model at 7,219 tokens/s BF16 decode or 14,438 tokens/s FP8 decode. Training throughput is 2,406 tokens/s at 564 W average. With INT4 speculative decoding in FP8 mode, effective throughput reaches 72,188 tokens/s. Sophon's weight bandwidth is 4.2 PB/s per tile, yielding ~191–214× the weight bandwidth of an HBM4 package (22 TB/s for Rubin).
Economics: No HBM, Lower BOM
Morgan Stanley estimates an NVIDIA VR200 NVL72 rack at ~$7.8M, with HBM memory alone at $2.0M (25.7% of rack cost). Sophon's BOM is $8,358 per die—a ~9.9× cost reduction vs Rubin. The die loads weights once from NVMe at boot and retains them with ~3 W idle power.
Why It Matters
Sophon demonstrates that monolithic-3D with TMD transistors can overcome the HBM bandwidth wall. For inference serving at low batch, weight bandwidth—not compute FLOPS—is the bottleneck. Sophon's architecture makes every MAC its own memory controller, eliminating HBM's shared bus contention. If manufacturable at scale, this could reshape AI hardware economics.

