AMD Strix Halo RDMA Cluster: Setup Guide for Distributed vLL

AMD Strix Halo RDMA Cluster Setup Guide

This guide walks through building a two-node AMD Strix Halo cluster connected via Intel E810 (RoCE v2) for distributed vLLM inference. Tensor Parallelism splits model layers across nodes, requiring sub-10µs latency. RoCE v2 delivers that.

TL;DR (Quick Start)

On both nodes:

Install/Update Fedora 43 and E810 NICs. Check firmware with ethtool -i .
BIOS: Set iGPU to 512MB. Kernel params: iommu=pt pci=realloc pcie_aspm=off amdgpu.gttsize=126976 ttm.pages_limit=32505856.
Configure passwordless SSH.
Assign static IPs (192.168.100.1 & .2), MTU 9000, trust interface in firewall.
Run ./refresh_toolbox.sh (installs container with RDMA support and custom librccl.so patch).
Run start-vllm-cluster, select "2. Start Ray Cluster", then "4. Launch VLLM Serve". Export HF_TOKEN for gated models.

Concepts & Architecture

vLLM: High-performance inference engine. Tensor Parallelism (TP) splits model across GPUs.
Ray: Orchestrates cluster, manages workers.
RCCL: AMD's collective communication library. Handles data plane: fast tensor sync between GPUs. TP=2 exchanges partial results after every layer, thousands of times per second.
RoCE v2: RDMA over Converged Ethernet. Writes data directly from one node's memory to another, bypassing CPU/OS kernel.
- Without RDMA: ~70-100µs latency (TCP/IP).
- With RDMA: ~5µs latency.

Hardware Prerequisites

Nodes: 2x Framework Desktop Mainboards with AMD Ryzen AI MAX+ "Strix Halo", 128GB Unified Memory.
Network Cards: Intel Ethernet Controller E810-CQDA1 (100GbE QSFP28).
Connection: Direct Attach Copper (DAC) cable (e.g., QSFPTEK 100G QSFP28 DAC). No switch needed for 2 nodes.
PCIe Note: Framework motherboard PCIe slot is physically x4. Use a riser (e.g., CY PCI-E Express 4x to 16x Extender). Performance identical (~50Gbps, ~5µs latency).

Host Configuration (Fedora)

Tested on Fedora 43 with kernels 6.18.5 and 6.18.6.

4.1 Install Packages

sudo dnf install rdma-core libibverbs-utils perftest

rdma-core: Userspace RDMA components.
libibverbs-utils: Query RDMA devices (ibv_devinfo).
perftest: Benchmarks (ib_write_bw, ib_send_lat).

4.2 Check Native Firmware

ethtool -i enp194s0np0

Example output:

firmware-version: 4.91 0x800214b5 1.3909.0

Update if older using Intel NVM Update Tool.

4.3 Network Configuration

Node 1 (Head - 192.168.100.1):

sudo ip link set enp194s0np0 up
sudo ip addr add 192.168.100.1/30 dev enp194s0np0
sudo nmcli connection modify &#34;rdma0&#34; ethernet.mtu 9000
sudo nmcli connection up &#34;rdma0&#34;

Node 2 (Worker - 192.168.100.2):

sudo ip link set enp194s0np0 up
sudo ip addr add 192.168.100.2/30 dev enp194s0np0
sudo nmcli connection modify &#34;rdma0&#34; ethernet.mtu 9000
sudo nmcli connection up &#34;rdma0&#34;

Verify link: rdma link should show state ACTIVE physical_state LINK_UP.

4.4 BIOS & Kernel Configuration

BIOS: Set iGPU Memory Allocation to 512MB.

Kernel params in /etc/default/grub:

iommu=pt pci=realloc pcie_aspm=off amdgpu.gttsize=126976 ttm.pages_limit=32505856

iommu=pt: IOMMU pass-through for NIC and iGPU.
pci=realloc: Reallocate PCI BARs for large address spaces.
pcie_aspm=off: Disable power management to avoid latency spikes.
amdgpu.gttsize=126976: GPU GTT size ~124GiB.
ttm.pages_limit=32505856: TTM pages limit matching GTT.

Apply: sudo grub2-mkconfig -o /boot/grub2/grub.cfg && sudo reboot.

4.5 Firewall Rules

sudo firewall-cmd --permanent --zone=trusted --add-interface=enp194s0np0
sudo firewall-cmd --reload

Toolbox Installation & Network Verification

5.1 Passwordless SSH

Configure SSH keys between nodes. Test with ssh date.

5.2 Installation

Run ./refresh_toolbox.sh on both nodes. This pulls kyuz0/vllm-therock-gfx1151 image, detects /dev/infiniband, and creates a toolbox with:

--device /dev/dri /dev/kfd (iGPU/ROCm)
--device /dev/infiniband --group-add rdma
--ulimit memlock=-1 (memory pinning for DMA)

5.3 Verify RDMA Connection

Inside toolbox on head node:

/opt/compare_eth_vs_rdma.sh

Expected results:

Path                 Latency      Bandwidth
------------------------------------------------
Ethernet (1G LAN)    0.074 ms     0.94 Gbps
Ethernet (RoCE NIC)  0.068 ms     55.70 Gbps
RDMA (RoCE)          5.23 us      50.64 Gbps

Running the Cluster

6.1 Setup & Verify

Enter toolbox, run start-vllm-cluster.

Option 1: Configure IPs (Head: 192.168.100.1, Worker: 192.168.100.2).
Option 2: Start Ray Cluster. Select "Head" on Node 1, "Worker" on Node 2.
Option 3: Check status (expect 2 nodes, 2.0 GPU).

6.2 Launching vLLM

Select "4. Launch VLLM Serve".
Choose model (e.g., Meta-Llama-3.1-8B-Instruct).
Set Tensor Parallelism = 2.
Enable "Force Eager Mode" (CUDA Graphs can deadlock distributed APU clusters).
Launch.

Gotchas:

First run: each node downloads weights independently.
Gated models: export HF_TOKEN before running script.

Troubleshooting

vLLM hangs: Enable "Force Eager Mode".
Link issues: Update Intel E810 firmware.

Next Steps

Try the cluster with a model like Llama 3.1 8B at TP=2. Benchmark tokens per second vs single-node. For production, consider adding more nodes or upgrading to 200GbE.

AMD Strix Halo RDMA Cluster: Setup Guide for Distributed vLLM Inference