[ROCm] Add AMD GPU support to the simulation engine#710
Open
jeffdaily wants to merge 2 commits into
Open
Conversation
alien's simulation engine was CUDA-only and its device-selection gate rejected any GPU below compute capability 6.0, so it did not run on AMD hardware at all. This adds a HIP build (USE_HIP, default OFF) alongside the existing CUDA build: only the .cu translation units see the HIP toolchain, host C++ is untouched, and the NVIDIA build is byte-for-byte unchanged. Review order: start with the top-level CMakeLists.txt (the USE_HIP option, language gating, the .cu->HIP retag, and the relocatable-device-code wiring), then source/EngineKernels/cuda_to_hip.h (the single compat header, force-included on every HIP and host TU) and source/hip_compat/ (forwarding shims for the toolkit angle-bracket includes, on the HIP include path only). The remaining files are the handful of real semantic fixes and the Windows build support. Relocatable device code is load-bearing: __constant__ cudaSimulationParameters (ConstantMemory.cu) is read by device code in sibling EngineKernels translation units, so the CUDA path already uses CUDA_SEPARABLE_COMPILATION. On HIP, CMake 3.31 neither injects -fgpu-rdc from the property nor emits a separate device-link step, so -fgpu-rdc is added to the .cu compiles and -fgpu-rdc --hip-link to the executables that link the engine libraries; without both the device link fails with undefined cudaSimulationParameters / __hip_fatbin_* symbols. Semantic fixes (each correct on wave32 and wave64, NVIDIA path unchanged): - USE_HIP-guard the compute-capability gate so AMD GPUs are accepted. - Replace cg::reduce in the SPH fluid kernels with a tile shfl_xor butterfly in the cooperative_groups/reduce.h shim (HIP cooperative groups has no cg::reduce/cg::plus). tiled_partition<32> slices the wavefront into fixed 32-lane tiles and every tile shuffle is tile-relative, so the code is identical on wave32 and wave64. The two fluid kernels launch at blockDim 25 and 81, so their last tile is partial; the butterfly substitutes the reduction identity for any partner read from a non-resident lane (rank ^ offset >= the tile's active-lane count), so lane 0 -- the only lane the call sites consume -- yields exactly the resident-lane sum without relying on the value a shuffle returns for an inactive lane. - Guard out the float2/float3/int2 operators HIP's vector types already define component-wise (same semantics) to avoid an ambiguous overload; keep the mixed float2-int2 helper HIP lacks. - Cast max(int, uint32_t) operands to a common type (HIP's mixed overload returns double, breaking a uint64_t modulo). - Give a muscle-activation lambda an explicit float return type. Windows support (gated on the compiler frontend variant, the Linux/GCC build unaffected): - clang-cl (MSVC frontend): /FI force-include, /clang:-ffast-math, NOMINMAX on the HIP TUs. - Work around a clang-offload-bundler input==output failure (override CMAKE_HIP_COMPILE_OBJECT to emit -o instead of /Fo) and drive the -fgpu-rdc device link through a clang-driver wrapper (cmake/hip_link_win.py) for both the MSVC-frontend and GCC-frontend clang++ link paths, since lld-link cannot perform the HIP device link. Building on AMD GPUs (also documented in the README's build instructions): ``` mkdir build && cd build cmake .. -DCMAKE_BUILD_TYPE=Release -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx90a cmake --build . --config Release -j8 ``` Test Plan: Built and validated on AMD Instinct MI250X (gfx90a, CDNA2 wave64), Radeon Pro W7800 (gfx1100, RDNA3 wave32), Radeon PRO V710 (gfx1101, RDNA3 wave32, Windows), and Radeon RX 9070 XT (gfx1201, RDNA4 wave32, Windows). - EngineTests on gfx90a: 2978 tests, 2973 passed, 3 skipped, 2 failed -- the same tally on both the debug (direct-launch) and CUDA-graph paths, including the three 10000-20000-timestep long-runners. CLI smoke (graph path, headless): 1000 timesteps, fault-free, total energy conserved. - The two failures are not port defects. CommunicatorTests.sender_signalPriority_lowerNumTimesSentWins exercises an upstream last-writer-wins race in CommunicatorProcessor::tryTransmitSignal (no priority compare; the winner depends on block execution order, which differs across vendors). DataTransferTests.multipleCells_genome_multipleGenes_multipleNodes is a pure set/get round-trip compared with exact float ==, where one cell-position path rounds ~1 ULP (2^-17 at coordinate ~100) off CUDA -- physically identical to 8 significant figures. - GeometryTests is the GL-interop render path and needs an on-screen GL context, separate from the headless compute path these tests cover. Authored with the assistance of Claude (Anthropic).
The documented HIP build configured find_package(hip) but did not point CMake at the ROCm install. When CMake is driven through the vcpkg toolchain file, vcpkg does not add /opt/rocm to the prefix path, so on a clean ROCm container configure aborts with hip_DIR-NOTFOUND. Setting CMAKE_HIP_COMPILER to the absolute clang++ path and exporting ROCM_PATH do not fix this; only -DCMAKE_PREFIX_PATH=/opt/rocm (or putting /opt/rocm/bin on PATH) lets find_package(hip) resolve. This is a documentation-only change to the README build block; no code or CMake logic is touched. Authored with the assistance of an AI coding agent.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
alien's simulation engine was CUDA-only and its device-selection gate rejected any GPU below compute capability 6.0, so it did not run on AMD hardware at all. This adds a HIP build (USE_HIP, default OFF) alongside the existing CUDA build: only the .cu translation units see the HIP toolchain, host C++ is untouched, and the NVIDIA build is byte-for-byte unchanged.
Review order: start with the top-level CMakeLists.txt (the USE_HIP option, language gating, the .cu->HIP retag, and the relocatable-device-code wiring), then source/EngineKernels/cuda_to_hip.h (the single compat header, force-included on every HIP and host TU) and source/hip_compat/ (forwarding shims for the toolkit angle-bracket includes, on the HIP include path only). The remaining files are the handful of real semantic fixes and the Windows build support.
Relocatable device code is load-bearing:
__constant__cudaSimulationParameters (ConstantMemory.cu) is read by device code in sibling EngineKernels translation units, so the CUDA path already uses CUDA_SEPARABLE_COMPILATION. On HIP, CMake 3.31 neither injects -fgpu-rdc from the property nor emits a separate device-link step, so -fgpu-rdc is added to the .cu compiles and -fgpu-rdc --hip-link to the executables that link the engine libraries; without both the device link fails with undefined cudaSimulationParameters / _hip_fatbin* symbols.Semantic fixes (each correct on wave32 and wave64, NVIDIA path unchanged):
Windows support (gated on the compiler frontend variant, the Linux/GCC build unaffected):
Building on AMD GPUs (also documented in the README's build instructions):
Test Plan:
Built and validated on AMD Instinct MI250X (gfx90a, CDNA2 wave64), Radeon Pro W7800 (gfx1100, RDNA3 wave32), Radeon PRO V710 (gfx1101, RDNA3 wave32, Windows), and Radeon RX 9070 XT (gfx1201, RDNA4 wave32, Windows).
Authored with the assistance of Claude (Anthropic).