Add IB merge auto view selection#2217
Draft
CyberSecurityErial wants to merge 1 commit into
Draft
Conversation
Problem: NCCL_IB_MERGE_NICS has a fixed merged or unmerged topology view. On two-node multi-HCA systems, the default merged view can search fewer ring channels than the unmerged view and leave available rails unused. Solution: Add merge-view topology construction and an opt-in NCCL_IB_MERGE_NICS=2 mode. MergeAuto builds unmerged and default merged topology/channel candidates, gathers candidate channel counts across ranks, selects the view with the larger global minimum channel count, and rebuilds the official topology with that selected view before normal graph search. Limitations: The first version is limited to two-node IB/RoCE communicators and uses only globalMinChannels for selection. More detailed rail-coverage or bandwidth scoring is left for future tuning. Verification: Two-node cluster validation passed in the qwqccl branch before rebasing onto upstream master. This upstream branch passes git diff --check. Local make reached C++ compilation and stopped because cuda_runtime.h is not available on this machine. Signed-off-by: EchO <2710555967@qq.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Draft implementation for #2216.
This PR adds an opt-in auto-selection mode for
NCCL_IB_MERGE_NICSon two-nodeIB/RoCE systems. The default behavior remains unchanged:
When auto mode is enabled, NCCL builds two topology/channel candidates before
normal transport setup:
UNMERGEDMERGED_DEFAULTIt gathers each candidate's searched ring channel count across ranks, selects
the view with the larger global minimum channel count, then rebuilds the
official topology using only the selected view before the normal graph search.
This is intentionally a simple first version. It does not run a benchmark, does
not establish duplicate transport connections, and does not do per-channel or
per-edge mixed merge selection.
Related Issues
Fixes #2216
Changes & Impact
reuse existing vNICs instead of appending duplicates.
views.
globalMinChannels.search.
Current first-version scope:
Out of scope for this PR:
Performance Impact
Default behavior is unchanged unless
NCCL_IB_MERGE_NICS=2is explicitly set.For
NCCL_IB_MERGE_NICS=2, communicator initialization performs two extracandidate topology/channel searches before selecting the official view. The data
path uses the selected normal topology only.
Validation performed:
NCCL_IB_MERGE_NICS=0, manualNCCL_IB_MERGE_NICS=1, andauto
NCCL_IB_MERGE_NICS=2.setting.
masterand retested successfully in the sameenvironment.