AMD Strix Halo RDMA Cluster Setup Guide

This guide walks through building a two-node AMD Strix Halo cluster connected via Intel E810 (RoCE v2) for distributed vLLM inference. Tensor Parallelism splits model layers across nodes, requiring sub-10µs latency. RoCE v2 delivers that.

TL;DR (Quick Start)

On both nodes:

  • Install/Update Fedora 43 and E810 NICs. Check firmware with ethtool -i .
  • BIOS: Set iGPU to 512MB. Kernel params: iommu=pt pci=realloc pcie_aspm=off amdgpu.gttsize=126976 ttm.pages_limit=32505856.
  • Configure passwordless SSH.
  • Assign static IPs (192.168.100.1 & .2), MTU 9000, trust interface in firewall.
  • Run ./refresh_toolbox.sh (installs container with RDMA support and custom librccl.so patch).
  • Run start-vllm-cluster, select "2. Start Ray Cluster", then "4. Launch VLLM Serve". Export HF_TOKEN for gated models.

Concepts & Architecture

  • vLLM: High-performance inference engine. Tensor Parallelism (TP) splits model across GPUs.
  • Ray: Orchestrates cluster, manages workers.
  • RCCL: AMD's collective communication library. Handles data plane: fast tensor sync between GPUs. TP=2 exchanges partial results after every layer, thousands of times per second.
  • RoCE v2: RDMA over Converged Ethernet. Writes data directly from one node's memory to another, bypassing CPU/OS kernel.
    • Without RDMA: ~70-100µs latency (TCP/IP).
    • With RDMA: ~5µs latency.

Hardware Prerequisites

  • Nodes: 2x Framework Desktop Mainboards with AMD Ryzen AI MAX+ "Strix Halo", 128GB Unified Memory.
  • Network Cards: Intel Ethernet Controller E810-CQDA1 (100GbE QSFP28).
  • Connection: Direct Attach Copper (DAC) cable (e.g., QSFPTEK 100G QSFP28 DAC). No switch needed for 2 nodes.
  • PCIe Note: Framework motherboard PCIe slot is physically x4. Use a riser (e.g., CY PCI-E Express 4x to 16x Extender). Performance identical (~50Gbps, ~5µs latency).

Host Configuration (Fedora)

Tested on Fedora 43 with kernels 6.18.5 and 6.18.6.

4.1 Install Packages

sudo dnf install rdma-core libibverbs-utils perftest
  • rdma-core: Userspace RDMA components.
  • libibverbs-utils: Query RDMA devices (ibv_devinfo).
  • perftest: Benchmarks (ib_write_bw, ib_send_lat).

4.2 Check Native Firmware

ethtool -i enp194s0np0

Example output:

firmware-version: 4.91 0x800214b5 1.3909.0

Update if older using Intel NVM Update Tool.

4.3 Network Configuration

Node 1 (Head - 192.168.100.1):

sudo ip link set enp194s0np0 up
sudo ip addr add 192.168.100.1/30 dev enp194s0np0
sudo nmcli connection modify "rdma0" ethernet.mtu 9000
sudo nmcli connection up "rdma0"

Node 2 (Worker - 192.168.100.2):

sudo ip link set enp194s0np0 up
sudo ip addr add 192.168.100.2/30 dev enp194s0np0
sudo nmcli connection modify "rdma0" ethernet.mtu 9000
sudo nmcli connection up "rdma0"

Verify link: rdma link should show state ACTIVE physical_state LINK_UP.

4.4 BIOS & Kernel Configuration

BIOS: Set iGPU Memory Allocation to 512MB.

Kernel params in /etc/default/grub:

iommu=pt pci=realloc pcie_aspm=off amdgpu.gttsize=126976 ttm.pages_limit=32505856
  • iommu=pt: IOMMU pass-through for NIC and iGPU.
  • pci=realloc: Reallocate PCI BARs for large address spaces.
  • pcie_aspm=off: Disable power management to avoid latency spikes.
  • amdgpu.gttsize=126976: GPU GTT size ~124GiB.
  • ttm.pages_limit=32505856: TTM pages limit matching GTT.

Apply: sudo grub2-mkconfig -o /boot/grub2/grub.cfg && sudo reboot.

4.5 Firewall Rules

sudo firewall-cmd --permanent --zone=trusted --add-interface=enp194s0np0
sudo firewall-cmd --reload

Toolbox Installation & Network Verification

5.1 Passwordless SSH

Configure SSH keys between nodes. Test with ssh date.

5.2 Installation

Run ./refresh_toolbox.sh on both nodes. This pulls kyuz0/vllm-therock-gfx1151 image, detects /dev/infiniband, and creates a toolbox with:

  • --device /dev/dri /dev/kfd (iGPU/ROCm)
  • --device /dev/infiniband --group-add rdma
  • --ulimit memlock=-1 (memory pinning for DMA)

5.3 Verify RDMA Connection

Inside toolbox on head node:

/opt/compare_eth_vs_rdma.sh

Expected results:

Path                 Latency      Bandwidth
------------------------------------------------
Ethernet (1G LAN)    0.074 ms     0.94 Gbps
Ethernet (RoCE NIC)  0.068 ms     55.70 Gbps
RDMA (RoCE)          5.23 us      50.64 Gbps

Running the Cluster

6.1 Setup & Verify

Enter toolbox, run start-vllm-cluster.

  • Option 1: Configure IPs (Head: 192.168.100.1, Worker: 192.168.100.2).
  • Option 2: Start Ray Cluster. Select "Head" on Node 1, "Worker" on Node 2.
  • Option 3: Check status (expect 2 nodes, 2.0 GPU).

6.2 Launching vLLM

  • Select "4. Launch VLLM Serve".
  • Choose model (e.g., Meta-Llama-3.1-8B-Instruct).
  • Set Tensor Parallelism = 2.
  • Enable "Force Eager Mode" (CUDA Graphs can deadlock distributed APU clusters).
  • Launch.

Gotchas:

  • First run: each node downloads weights independently.
  • Gated models: export HF_TOKEN before running script.

Troubleshooting

  • vLLM hangs: Enable "Force Eager Mode".
  • Link issues: Update Intel E810 firmware.

Next Steps

Try the cluster with a model like Llama 3.1 8B at TP=2. Benchmark tokens per second vs single-node. For production, consider adding more nodes or upgrading to 200GbE.

References