The Problem: GC Kills Throughput
Ian Cowley open-sourced three high-performance C# engines: Glacier.Polaris (DataFrame library), Glacier.Grep (text searcher), and Glacier.DocTree (semantic Markdown parser). All target zero-allocation hot paths. The core insight: every new on a reference type triggers the GC. For engines processing millions of rows per second, a GC pause is catastrophic.
Level 1: Structs Over Classes
The stack is thread-local, unwinds instantly, and never involves the GC. By using struct (value types) instead of class, you avoid heap allocation. Additionally, with [StructLayout(LayoutKind.Sequential)], you pack data into contiguous memory that aligns with CPU cache lines (64 bytes). This reduces cache misses.
Level 2: Span and Memory for Zero-Allocation Slicing
Traditional Substring() or Skip().Take() allocate new heap objects. Glacier engines use ReadOnlySpan to slice buffers without copying. A Span is a ref struct that lives on the stack and holds a pointer + length. Example:
// Zero-allocation slice
ReadOnlySpan lineSpan = "Error: Connection Timeout".AsSpan();
ReadOnlySpan messageSpan = lineSpan.Slice(7); // no allocation
Because Span can't cross await boundaries, Memory is used in async pipelines. Once synchronous processing resumes, call .Span to get a zero-cost view.
Level 3: ArrayPool for Temporary Buffers
Allocating large arrays in a loop triggers Gen 0 GC. Glacier.Polaris uses System.Buffers.ArrayPool.Shared to rent and return buffers:
int[] buffer = ArrayPool.Shared.Rent(100000);
try
{
Span workSpan = buffer.AsSpan(0, 100000);
// process
}
finally
{
ArrayPool.Shared.Return(buffer);
}
The GC never sees a new allocation. The pool reuses arrays from a shared pool.
Level 4: SIMD via Vector256 and MemoryMarshal
Once memory is flat and contiguous, you can use CPU vector instructions. Modern .NET provides cross-platform Vector256 without explicit intrinsics. Glacier.Polaris uses MemoryMarshal.GetReference to get an unpinned reference and feeds it into SIMD loops:
public static int SimdSum(ReadOnlySpan data)
{
int sum = 0;
int i = 0;
ref int current = ref MemoryMarshal.GetReference(data);
if (Vector256.IsHardwareAccelerated && data.Length >= Vector256.Count)
{
Vector256 vSum = Vector256.Zero;
for (; i <= data.Length - Vector256.Count; i += Vector256.Count)
{
Vector256 vData = Vector256.LoadUnsafe(ref current, (nuint)i);
vSum += vData;
}
sum += Vector256.Sum(vSum);
}
for (; i < data.Length; i++)
sum += Unsafe.Add(ref current, i);
return sum;
}
This processes 8 integers per instruction on hardware with 256-bit vectors.
Level 5: Benchmarks
Cowley shared BenchmarkDotNet results:
| Method | Mean | Allocated |
|---|---|---|
| Standard Substring | 18.45 ns | 32 B |
| Glacier Span Slice | 0.02 ns | 0 B |
The 0 B allocation column is the goal.
Conclusion
Identify hot paths where data flows by the gigabyte. Replace class with struct, slice with Span, rent from ArrayPool, and use SIMD via Vector256. The Glacier repositories on GitHub demonstrate these patterns in production-grade code. Study them, and you can build engines that push .NET to its limits.



