Skip to content

Add IB merge auto view selection#2217

Draft
CyberSecurityErial wants to merge 1 commit into
NVIDIA:masterfrom
CyberSecurityErial:topic/ib-merge-auto-upstream
Draft

Add IB merge auto view selection#2217
CyberSecurityErial wants to merge 1 commit into
NVIDIA:masterfrom
CyberSecurityErial:topic/ib-merge-auto-upstream

Conversation

@CyberSecurityErial

Copy link
Copy Markdown

Description

Draft implementation for #2216.

This PR adds an opt-in auto-selection mode for NCCL_IB_MERGE_NICS on two-node
IB/RoCE systems. The default behavior remains unchanged:

NCCL_IB_MERGE_NICS=0  # force unmerged
NCCL_IB_MERGE_NICS=1  # current default merged behavior
NCCL_IB_MERGE_NICS=2  # new opt-in auto mode

When auto mode is enabled, NCCL builds two topology/channel candidates before
normal transport setup:

  • UNMERGED
  • MERGED_DEFAULT

It gathers each candidate's searched ring channel count across ranks, selects
the view with the larger global minimum channel count, then rebuilds the
official topology using only the selected view before the normal graph search.

This is intentionally a simple first version. It does not run a benchmark, does
not establish duplicate transport connections, and does not do per-channel or
per-edge mixed merge selection.

Related Issues

Fixes #2216

Changes & Impact

  • Add a NET merge-view topology entry point.
  • Make IB virtual NIC creation idempotent so repeated candidate construction can
    reuse existing vNICs instead of appending duplicates.
  • Add a MergeAuto candidate builder for unmerged and default merged topology
    views.
  • Select the official view using only globalMinChannels.
  • Rebuild the official topology with the selected view before normal graph
    search.
  • Keep unsupported cases on the existing default path.

Current first-version scope:

  • two-node communicators only
  • IB/RoCE NET transport only
  • init-time selection only

Out of scope for this PR:

  • multi-node auto selection
  • runtime benchmark feedback
  • rail/bandwidth scoring
  • per-collective switching
  • channel-level or edge-level mixed merge

Performance Impact

Default behavior is unchanged unless NCCL_IB_MERGE_NICS=2 is explicitly set.

For NCCL_IB_MERGE_NICS=2, communicator initialization performs two extra
candidate topology/channel searches before selecting the official view. The data
path uses the selected normal topology only.

Validation performed:

  • Built and tested in a two-node multi-HCA A800/RoCE environment.
  • Compared manual NCCL_IB_MERGE_NICS=0, manual NCCL_IB_MERGE_NICS=1, and
    auto NCCL_IB_MERGE_NICS=2.
  • In the tested case, auto selected the unmerged view, matching the better manual
    setting.
  • Rebased onto current upstream master and retested successfully in the same
    environment.

Problem: NCCL_IB_MERGE_NICS has a fixed merged or unmerged topology view. On two-node multi-HCA systems, the default merged view can search fewer ring channels than the unmerged view and leave available rails unused.

Solution: Add merge-view topology construction and an opt-in NCCL_IB_MERGE_NICS=2 mode. MergeAuto builds unmerged and default merged topology/channel candidates, gathers candidate channel counts across ranks, selects the view with the larger global minimum channel count, and rebuilds the official topology with that selected view before normal graph search.

Limitations: The first version is limited to two-node IB/RoCE communicators and uses only globalMinChannels for selection. More detailed rail-coverage or bandwidth scoring is left for future tuning.

Verification: Two-node cluster validation passed in the qwqccl branch before rebasing onto upstream master. This upstream branch passes git diff --check. Local make reached C++ compilation and stopped because cuda_runtime.h is not available on this machine.
Signed-off-by: EchO <2710555967@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFE]: Add opt-in auto-selection for IB NIC merge mode on two-node multi-HCA systems

1 participant