Skip to content

[ROCm] Add AMD GPU support to the simulation engine#710

Open
jeffdaily wants to merge 2 commits into
chrxh:developfrom
jeffdaily:moat-port
Open

[ROCm] Add AMD GPU support to the simulation engine#710
jeffdaily wants to merge 2 commits into
chrxh:developfrom
jeffdaily:moat-port

Conversation

@jeffdaily

@jeffdaily jeffdaily commented Jun 8, 2026

Copy link
Copy Markdown

alien's simulation engine was CUDA-only and its device-selection gate rejected any GPU below compute capability 6.0, so it did not run on AMD hardware at all. This adds a HIP build (USE_HIP, default OFF) alongside the existing CUDA build: only the .cu translation units see the HIP toolchain, host C++ is untouched, and the NVIDIA build is byte-for-byte unchanged.

Review order: start with the top-level CMakeLists.txt (the USE_HIP option, language gating, the .cu->HIP retag, and the relocatable-device-code wiring), then source/EngineKernels/cuda_to_hip.h (the single compat header, force-included on every HIP and host TU) and source/hip_compat/ (forwarding shims for the toolkit angle-bracket includes, on the HIP include path only). The remaining files are the handful of real semantic fixes and the Windows build support.

Relocatable device code is load-bearing: __constant__ cudaSimulationParameters (ConstantMemory.cu) is read by device code in sibling EngineKernels translation units, so the CUDA path already uses CUDA_SEPARABLE_COMPILATION. On HIP, CMake 3.31 neither injects -fgpu-rdc from the property nor emits a separate device-link step, so -fgpu-rdc is added to the .cu compiles and -fgpu-rdc --hip-link to the executables that link the engine libraries; without both the device link fails with undefined cudaSimulationParameters / _hip_fatbin* symbols.

Semantic fixes (each correct on wave32 and wave64, NVIDIA path unchanged):

  • USE_HIP-guard the compute-capability gate so AMD GPUs are accepted.
  • Replace cg::reduce in the SPH fluid kernels with a tile shfl_xor butterfly in the cooperative_groups/reduce.h shim (HIP cooperative groups has no cg::reduce/cg::plus). tiled_partition<32> slices the wavefront into fixed 32-lane tiles and every tile shuffle is tile-relative, so the code is identical on wave32 and wave64. The two fluid kernels launch at blockDim 25 and 81, so their last tile is partial; the butterfly substitutes the reduction identity for any partner read from a non-resident lane (rank ^ offset >= the tile's active-lane count), so lane 0 -- the only lane the call sites consume -- yields exactly the resident-lane sum without relying on the value a shuffle returns for an inactive lane.
  • Guard out the float2/float3/int2 operators HIP's vector types already define component-wise (same semantics) to avoid an ambiguous overload; keep the mixed float2-int2 helper HIP lacks.
  • Cast max(int, uint32_t) operands to a common type (HIP's mixed overload returns double, breaking a uint64_t modulo).
  • Give a muscle-activation lambda an explicit float return type.

Windows support (gated on the compiler frontend variant, the Linux/GCC build unaffected):

  • clang-cl (MSVC frontend): /FI force-include, /clang:-ffast-math, NOMINMAX on the HIP TUs.
  • Work around a clang-offload-bundler input==output failure (override CMAKE_HIP_COMPILE_OBJECT to emit -o instead of /Fo) and drive the -fgpu-rdc device link through a clang-driver wrapper (cmake/hip_link_win.py) for both the MSVC-frontend and GCC-frontend clang++ link paths, since lld-link cannot perform the HIP device link.

Building on AMD GPUs (also documented in the README's build instructions):

mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx90a
cmake --build . --config Release -j8

Test Plan:

Built and validated on AMD Instinct MI250X (gfx90a, CDNA2 wave64), Radeon Pro W7800 (gfx1100, RDNA3 wave32), Radeon PRO V710 (gfx1101, RDNA3 wave32, Windows), and Radeon RX 9070 XT (gfx1201, RDNA4 wave32, Windows).

  • EngineTests on gfx90a: 2978 tests, 2973 passed, 3 skipped, 2 failed -- the same tally on both the debug (direct-launch) and CUDA-graph paths, including the three 10000-20000-timestep long-runners. CLI smoke (graph path, headless): 1000 timesteps, fault-free, total energy conserved.
  • The two failures are not port defects. CommunicatorTests.sender_signalPriority_lowerNumTimesSentWins exercises an upstream last-writer-wins race in CommunicatorProcessor::tryTransmitSignal (no priority compare; the winner depends on block execution order, which differs across vendors). DataTransferTests.multipleCells_genome_multipleGenes_multipleNodes is a pure set/get round-trip compared with exact float ==, where one cell-position path rounds ~1 ULP (2^-17 at coordinate ~100) off CUDA -- physically identical to 8 significant figures.
  • GeometryTests is the GL-interop render path and needs an on-screen GL context, separate from the headless compute path these tests cover.

Authored with the assistance of Claude (Anthropic).

jeffdaily added 2 commits June 8, 2026 17:57
alien's simulation engine was CUDA-only and its device-selection gate rejected any GPU below compute capability 6.0, so it did not run on AMD hardware at all. This adds a HIP build (USE_HIP, default OFF) alongside the existing CUDA build: only the .cu translation units see the HIP toolchain, host C++ is untouched, and the NVIDIA build is byte-for-byte unchanged.

Review order: start with the top-level CMakeLists.txt (the USE_HIP option, language gating, the .cu->HIP retag, and the relocatable-device-code wiring), then source/EngineKernels/cuda_to_hip.h (the single compat header, force-included on every HIP and host TU) and source/hip_compat/ (forwarding shims for the toolkit angle-bracket includes, on the HIP include path only). The remaining files are the handful of real semantic fixes and the Windows build support.

Relocatable device code is load-bearing: __constant__ cudaSimulationParameters (ConstantMemory.cu) is read by device code in sibling EngineKernels translation units, so the CUDA path already uses CUDA_SEPARABLE_COMPILATION. On HIP, CMake 3.31 neither injects -fgpu-rdc from the property nor emits a separate device-link step, so -fgpu-rdc is added to the .cu compiles and -fgpu-rdc --hip-link to the executables that link the engine libraries; without both the device link fails with undefined cudaSimulationParameters / __hip_fatbin_* symbols.

Semantic fixes (each correct on wave32 and wave64, NVIDIA path unchanged):
- USE_HIP-guard the compute-capability gate so AMD GPUs are accepted.
- Replace cg::reduce in the SPH fluid kernels with a tile shfl_xor butterfly in the cooperative_groups/reduce.h shim (HIP cooperative groups has no cg::reduce/cg::plus). tiled_partition<32> slices the wavefront into fixed 32-lane tiles and every tile shuffle is tile-relative, so the code is identical on wave32 and wave64. The two fluid kernels launch at blockDim 25 and 81, so their last tile is partial; the butterfly substitutes the reduction identity for any partner read from a non-resident lane (rank ^ offset >= the tile's active-lane count), so lane 0 -- the only lane the call sites consume -- yields exactly the resident-lane sum without relying on the value a shuffle returns for an inactive lane.
- Guard out the float2/float3/int2 operators HIP's vector types already define component-wise (same semantics) to avoid an ambiguous overload; keep the mixed float2-int2 helper HIP lacks.
- Cast max(int, uint32_t) operands to a common type (HIP's mixed overload returns double, breaking a uint64_t modulo).
- Give a muscle-activation lambda an explicit float return type.

Windows support (gated on the compiler frontend variant, the Linux/GCC build unaffected):
- clang-cl (MSVC frontend): /FI force-include, /clang:-ffast-math, NOMINMAX on the HIP TUs.
- Work around a clang-offload-bundler input==output failure (override CMAKE_HIP_COMPILE_OBJECT to emit -o instead of /Fo) and drive the -fgpu-rdc device link through a clang-driver wrapper (cmake/hip_link_win.py) for both the MSVC-frontend and GCC-frontend clang++ link paths, since lld-link cannot perform the HIP device link.

Building on AMD GPUs (also documented in the README's build instructions):
```
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx90a
cmake --build . --config Release -j8
```

Test Plan:

Built and validated on AMD Instinct MI250X (gfx90a, CDNA2 wave64), Radeon Pro W7800 (gfx1100, RDNA3 wave32), Radeon PRO V710 (gfx1101, RDNA3 wave32, Windows), and Radeon RX 9070 XT (gfx1201, RDNA4 wave32, Windows).

- EngineTests on gfx90a: 2978 tests, 2973 passed, 3 skipped, 2 failed -- the same tally on both the debug (direct-launch) and CUDA-graph paths, including the three 10000-20000-timestep long-runners. CLI smoke (graph path, headless): 1000 timesteps, fault-free, total energy conserved.
- The two failures are not port defects. CommunicatorTests.sender_signalPriority_lowerNumTimesSentWins exercises an upstream last-writer-wins race in CommunicatorProcessor::tryTransmitSignal (no priority compare; the winner depends on block execution order, which differs across vendors). DataTransferTests.multipleCells_genome_multipleGenes_multipleNodes is a pure set/get round-trip compared with exact float ==, where one cell-position path rounds ~1 ULP (2^-17 at coordinate ~100) off CUDA -- physically identical to 8 significant figures.
- GeometryTests is the GL-interop render path and needs an on-screen GL context, separate from the headless compute path these tests cover.

Authored with the assistance of Claude (Anthropic).
The documented HIP build configured find_package(hip) but did not point
CMake at the ROCm install. When CMake is driven through the vcpkg
toolchain file, vcpkg does not add /opt/rocm to the prefix path, so on a
clean ROCm container configure aborts with hip_DIR-NOTFOUND. Setting
CMAKE_HIP_COMPILER to the absolute clang++ path and exporting ROCM_PATH
do not fix this; only -DCMAKE_PREFIX_PATH=/opt/rocm (or putting
/opt/rocm/bin on PATH) lets find_package(hip) resolve.

This is a documentation-only change to the README build block; no code
or CMake logic is touched.

Authored with the assistance of an AI coding agent.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant