[BUG] Fix CAGRA search recall with a graph built by NN Descent#819
[BUG] Fix CAGRA search recall with a graph built by NN Descent#819enp1s0 wants to merge 73 commits into
Conversation
ad5bf28 to
8a52d05
Compare
KyleFromNVIDIA
left a comment
There was a problem hiding this comment.
All CMake/CI/packaging changes have been removed, so approving.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cpp/src/neighbors/detail/cagra/jit_lto_kernels/compute_distance_matrix.json`:
- Around line 148-153: The generated JIT variant uses pq_len="8" which is
unsupported by add_compute_distance_device_function (it only handles 2 and 4 and
otherwise fails); fix by either removing/avoiding generating pq_len=8 in the
JSON or by extending add_compute_distance_device_function to accept pq_len==8
(add a branch/case that implements the 8-subvector path and update its planner
guard to allow 8), and ensure the planner guard logic in cagra_planner_base (the
same guard that currently restricts to {2,4}) is updated consistently so runtime
selection of the new variant will not hard-fail.
In `@cpp/src/neighbors/detail/cagra/jit_lto_kernels/setup_workspace_matrix.json`:
- Around line 148-153: The matrix variant adds pq_len="8", which will
deterministically fail because add_setup_workspace_device_function enforces
pq_len in {2,4} (see cagra_planner_base.hpp); fix by changing or removing that
variant so pq_len is one of the allowed values (e.g., set "pq_len": "4" or
delete the entry), ensuring setup_workspace matrix entries conform to
add_setup_workspace_device_function's contract.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 435ce792-2439-4aa7-87bd-cf2c83c5a341
📒 Files selected for processing (4)
cpp/src/neighbors/detail/cagra/compute_distance_vpq_matrix.jsoncpp/src/neighbors/detail/cagra/jit_lto_kernels/compute_distance_matrix.jsoncpp/src/neighbors/detail/cagra/jit_lto_kernels/setup_workspace_matrix.jsoncpp/src/neighbors/detail/nn_descent.cuh
🚧 Files skipped from review as they are similar to previous changes (1)
- cpp/src/neighbors/detail/nn_descent.cuh
| { | ||
| "pq_len": "8", | ||
| "pq_bits": "8", | ||
| "pq_prefix": "_vpq", | ||
| "pq_suffix": "_8pq_8subd" | ||
| } |
There was a problem hiding this comment.
pq_len=8 in JIT compute_distance matrix conflicts with planner guard.
Line 149 introduces pq_len: "8", but add_compute_distance_device_function currently accepts only {2,4} and hard-fails otherwise (cagra_planner_base.hpp:116-145). This makes these generated variants crash at runtime when selected.
Suggested minimal fix
"_pq": [
{
"pq_len": "2",
"pq_bits": "8",
"pq_prefix": "_vpq",
"pq_suffix": "_8pq_2subd"
},
{
"pq_len": "4",
"pq_bits": "8",
"pq_prefix": "_vpq",
"pq_suffix": "_8pq_4subd"
- },
- {
- "pq_len": "8",
- "pq_bits": "8",
- "pq_prefix": "_vpq",
- "pq_suffix": "_8pq_8subd"
}
],📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| { | |
| "pq_len": "8", | |
| "pq_bits": "8", | |
| "pq_prefix": "_vpq", | |
| "pq_suffix": "_8pq_8subd" | |
| } | |
| "_pq": [ | |
| { | |
| "pq_len": "2", | |
| "pq_bits": "8", | |
| "pq_prefix": "_vpq", | |
| "pq_suffix": "_8pq_2subd" | |
| }, | |
| { | |
| "pq_len": "4", | |
| "pq_bits": "8", | |
| "pq_prefix": "_vpq", | |
| "pq_suffix": "_8pq_4subd" | |
| } | |
| ], |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cpp/src/neighbors/detail/cagra/jit_lto_kernels/compute_distance_matrix.json`
around lines 148 - 153, The generated JIT variant uses pq_len="8" which is
unsupported by add_compute_distance_device_function (it only handles 2 and 4 and
otherwise fails); fix by either removing/avoiding generating pq_len=8 in the
JSON or by extending add_compute_distance_device_function to accept pq_len==8
(add a branch/case that implements the 8-subvector path and update its planner
guard to allow 8), and ensure the planner guard logic in cagra_planner_base (the
same guard that currently restricts to {2,4}) is updated consistently so runtime
selection of the new variant will not hard-fail.
There was a problem hiding this comment.
I guess these changes leaked from the same updates in #1533 accidentally?
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
gh pr view 1533 --repo rapidsai/cuvs --json title,body,state,headRefName,baseRefName | jq '{title, state, headRefName, baseRefName}'
gh pr diff 1533 --repo rapidsai/cuvs -- cpp/src/neighbors/detail/cagra/jit_lto_kernels/compute_distance_matrix.json 2>/dev/null | head -80Length of output: 483
@achirkin Yes, that appears to be exactly the case. PR #1533 is titled "Improve CAGRA-Q performance and add support for PQ_LEN=8" (branch cagra-q-pq_len-8), which directly confirms your suspicion — the pq_len=8 entries in both compute_distance_matrix.json and setup_workspace_matrix.json are unrelated to this bug fix and should be reverted from this PR. The planner guard issue I flagged (only accepting {2,4}) would be addressed as part of #1533, not here.
| { | ||
| "pq_len": "8", | ||
| "pq_bits": "8", | ||
| "pq_prefix": "_vpq", | ||
| "pq_suffix": "_8pq_8subd" | ||
| } |
There was a problem hiding this comment.
pq_len=8 in setup_workspace matrix violates runtime contract.
Line 149 adds pq_len: "8", but add_setup_workspace_device_function currently enforces pq_len in {2,4} and throws RAFT_FAIL otherwise (cagra_planner_base.hpp:39-89). This introduces a deterministic runtime failure for that variant.
Suggested minimal fix
"_pq": [
{
"pq_len": "2",
"pq_bits": "8",
"pq_prefix": "_vpq",
"pq_suffix": "_8pq_2subd"
},
{
"pq_len": "4",
"pq_bits": "8",
"pq_prefix": "_vpq",
"pq_suffix": "_8pq_4subd"
- },
- {
- "pq_len": "8",
- "pq_bits": "8",
- "pq_prefix": "_vpq",
- "pq_suffix": "_8pq_8subd"
}
],🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cpp/src/neighbors/detail/cagra/jit_lto_kernels/setup_workspace_matrix.json`
around lines 148 - 153, The matrix variant adds pq_len="8", which will
deterministically fail because add_setup_workspace_device_function enforces
pq_len in {2,4} (see cagra_planner_base.hpp); fix by changing or removing that
variant so pq_len is one of the allowed values (e.g., set "pq_len": "4" or
delete the entry), ensuring setup_workspace matrix entries conform to
add_setup_workspace_device_function's contract.
achirkin
left a comment
There was a problem hiding this comment.
LGTM and sorry you've had to keep the PR in sync with changes in cuvs for so long!
One note: could you please check one more time that all changes to the instantiation matrices are intentional?
This reverts commit b4d348c.
|
/ok to test 456d7a8 |
This PR addresses an unexpected low recall issue in the CAGRA search with a graph generated by NN Descent.
For that, it updates the initial NN Descent graph generation so that all indices are included, whereas some nodes are not found in the branch-25.10 implementation.
(itopk=256)
This PR also makes the following changes: