Skip to content

[BUG] Fix CAGRA search recall with a graph built by NN Descent#819

Open
enp1s0 wants to merge 73 commits into
NVIDIA:mainfrom
enp1s0:cagra-fix-nnd-init
Open

[BUG] Fix CAGRA search recall with a graph built by NN Descent#819
enp1s0 wants to merge 73 commits into
NVIDIA:mainfrom
enp1s0:cagra-fix-nnd-init

Conversation

@enp1s0

@enp1s0 enp1s0 commented Apr 12, 2025

Copy link
Copy Markdown
Contributor

This PR addresses an unexpected low recall issue in the CAGRA search with a graph generated by NN Descent.
For that, it updates the initial NN Descent graph generation so that all indices are included, whereas some nodes are not found in the branch-25.10 implementation.

figure

(itopk=256)

This PR also makes the following changes:

  • Adds a workaround to remove duplicate node indices from the kNN result
  • Adds test cases for different graph degrees

@enp1s0 enp1s0 requested a review from a team as a code owner April 12, 2025 16:00
@enp1s0 enp1s0 self-assigned this Apr 12, 2025
@github-actions github-actions Bot added the cpp label Apr 12, 2025
@enp1s0 enp1s0 added bug Something isn't working non-breaking Introduces a non-breaking change cpp and removed cpp labels Apr 12, 2025
@cjnolet cjnolet moved this to In Progress in Unstructured Data Processing Apr 22, 2025

@tfeher tfeher left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @enp1s0 for this PR! Looks good overall, I have just one request to clarify the expectations for the random graph.

Comment thread cpp/src/neighbors/detail/nn_descent.cuh
@tfeher tfeher changed the base branch from branch-25.06 to branch-25.08 May 26, 2025 08:36
@enp1s0 enp1s0 force-pushed the cagra-fix-nnd-init branch from ad5bf28 to 8a52d05 Compare July 18, 2025 22:24
@enp1s0 enp1s0 requested review from a team as code owners September 6, 2025 08:52
@enp1s0 enp1s0 requested a review from bdice September 6, 2025 08:52
@copy-pr-bot

copy-pr-bot Bot commented Sep 6, 2025

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@enp1s0 enp1s0 changed the base branch from branch-25.08 to branch-25.10 September 6, 2025 08:53

@KyleFromNVIDIA KyleFromNVIDIA left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All CMake/CI/packaging changes have been removed, so approving.

@cjnolet cjnolet changed the base branch from branch-25.10 to main November 3, 2025 17:06

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/src/neighbors/detail/cagra/jit_lto_kernels/compute_distance_matrix.json`:
- Around line 148-153: The generated JIT variant uses pq_len="8" which is
unsupported by add_compute_distance_device_function (it only handles 2 and 4 and
otherwise fails); fix by either removing/avoiding generating pq_len=8 in the
JSON or by extending add_compute_distance_device_function to accept pq_len==8
(add a branch/case that implements the 8-subvector path and update its planner
guard to allow 8), and ensure the planner guard logic in cagra_planner_base (the
same guard that currently restricts to {2,4}) is updated consistently so runtime
selection of the new variant will not hard-fail.

In `@cpp/src/neighbors/detail/cagra/jit_lto_kernels/setup_workspace_matrix.json`:
- Around line 148-153: The matrix variant adds pq_len="8", which will
deterministically fail because add_setup_workspace_device_function enforces
pq_len in {2,4} (see cagra_planner_base.hpp); fix by changing or removing that
variant so pq_len is one of the allowed values (e.g., set "pq_len": "4" or
delete the entry), ensuring setup_workspace matrix entries conform to
add_setup_workspace_device_function's contract.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 435ce792-2439-4aa7-87bd-cf2c83c5a341

📥 Commits

Reviewing files that changed from the base of the PR and between 819b621 and b116082.

📒 Files selected for processing (4)
  • cpp/src/neighbors/detail/cagra/compute_distance_vpq_matrix.json
  • cpp/src/neighbors/detail/cagra/jit_lto_kernels/compute_distance_matrix.json
  • cpp/src/neighbors/detail/cagra/jit_lto_kernels/setup_workspace_matrix.json
  • cpp/src/neighbors/detail/nn_descent.cuh
🚧 Files skipped from review as they are similar to previous changes (1)
  • cpp/src/neighbors/detail/nn_descent.cuh

Comment on lines 148 to 153
{
"pq_len": "8",
"pq_bits": "8",
"pq_prefix": "_vpq",
"pq_suffix": "_8pq_8subd"
}

@coderabbitai coderabbitai Bot Jun 4, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

pq_len=8 in JIT compute_distance matrix conflicts with planner guard.

Line 149 introduces pq_len: "8", but add_compute_distance_device_function currently accepts only {2,4} and hard-fails otherwise (cagra_planner_base.hpp:116-145). This makes these generated variants crash at runtime when selected.

Suggested minimal fix
     "_pq": [
       {
         "pq_len": "2",
         "pq_bits": "8",
         "pq_prefix": "_vpq",
         "pq_suffix": "_8pq_2subd"
       },
       {
         "pq_len": "4",
         "pq_bits": "8",
         "pq_prefix": "_vpq",
         "pq_suffix": "_8pq_4subd"
-      },
-      {
-        "pq_len": "8",
-        "pq_bits": "8",
-        "pq_prefix": "_vpq",
-        "pq_suffix": "_8pq_8subd"
       }
     ],
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
{
"pq_len": "8",
"pq_bits": "8",
"pq_prefix": "_vpq",
"pq_suffix": "_8pq_8subd"
}
"_pq": [
{
"pq_len": "2",
"pq_bits": "8",
"pq_prefix": "_vpq",
"pq_suffix": "_8pq_2subd"
},
{
"pq_len": "4",
"pq_bits": "8",
"pq_prefix": "_vpq",
"pq_suffix": "_8pq_4subd"
}
],
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/neighbors/detail/cagra/jit_lto_kernels/compute_distance_matrix.json`
around lines 148 - 153, The generated JIT variant uses pq_len="8" which is
unsupported by add_compute_distance_device_function (it only handles 2 and 4 and
otherwise fails); fix by either removing/avoiding generating pq_len=8 in the
JSON or by extending add_compute_distance_device_function to accept pq_len==8
(add a branch/case that implements the 8-subvector path and update its planner
guard to allow 8), and ensure the planner guard logic in cagra_planner_base (the
same guard that currently restricts to {2,4}) is updated consistently so runtime
selection of the new variant will not hard-fail.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess these changes leaked from the same updates in #1533 accidentally?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
gh pr view 1533 --repo rapidsai/cuvs --json title,body,state,headRefName,baseRefName | jq '{title, state, headRefName, baseRefName}'
gh pr diff 1533 --repo rapidsai/cuvs -- cpp/src/neighbors/detail/cagra/jit_lto_kernels/compute_distance_matrix.json 2>/dev/null | head -80

Length of output: 483



@achirkin Yes, that appears to be exactly the case. PR #1533 is titled "Improve CAGRA-Q performance and add support for PQ_LEN=8" (branch cagra-q-pq_len-8), which directly confirms your suspicion — the pq_len=8 entries in both compute_distance_matrix.json and setup_workspace_matrix.json are unrelated to this bug fix and should be reverted from this PR. The planner guard issue I flagged (only accepting {2,4}) would be addressed as part of #1533, not here.

Comment on lines 148 to 153
{
"pq_len": "8",
"pq_bits": "8",
"pq_prefix": "_vpq",
"pq_suffix": "_8pq_8subd"
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

pq_len=8 in setup_workspace matrix violates runtime contract.

Line 149 adds pq_len: "8", but add_setup_workspace_device_function currently enforces pq_len in {2,4} and throws RAFT_FAIL otherwise (cagra_planner_base.hpp:39-89). This introduces a deterministic runtime failure for that variant.

Suggested minimal fix
     "_pq": [
       {
         "pq_len": "2",
         "pq_bits": "8",
         "pq_prefix": "_vpq",
         "pq_suffix": "_8pq_2subd"
       },
       {
         "pq_len": "4",
         "pq_bits": "8",
         "pq_prefix": "_vpq",
         "pq_suffix": "_8pq_4subd"
-      },
-      {
-        "pq_len": "8",
-        "pq_bits": "8",
-        "pq_prefix": "_vpq",
-        "pq_suffix": "_8pq_8subd"
       }
     ],
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/neighbors/detail/cagra/jit_lto_kernels/setup_workspace_matrix.json`
around lines 148 - 153, The matrix variant adds pq_len="8", which will
deterministically fail because add_setup_workspace_device_function enforces
pq_len in {2,4} (see cagra_planner_base.hpp); fix by changing or removing that
variant so pq_len is one of the allowed values (e.g., set "pq_len": "4" or
delete the entry), ensuring setup_workspace matrix entries conform to
add_setup_workspace_device_function's contract.

@achirkin achirkin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM and sorry you've had to keep the PR in sync with changes in cuvs for so long!

One note: could you please check one more time that all changes to the instantiation matrices are intentional?

@enp1s0

enp1s0 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

@achirkin Thank you for approving the changes. Yes, the updates in #1533 were accidentally included here. I reverted the commit.

@tfeher tfeher left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @enp1s0 for the updates.

@tfeher tfeher removed request for a team June 16, 2026 07:42
@aamijar

aamijar commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

/ok to test 456d7a8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ci CMake cpp non-breaking Introduces a non-breaking change Python stale-active

Projects

None yet

Development

Successfully merging this pull request may close these issues.