Phase 3: DuckDB backend for out-of-core TDC confidence estimation#28
Phase 3: DuckDB backend for out-of-core TDC confidence estimation#28wsnoble wants to merge 3 commits into
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Adds an explicit opt-in backend='duckdb' to assign_confidence() that implements the full TDC pipeline (competition, FDR, q-values) using DuckDB SQL window functions, enabling datasets larger than RAM. - crema/backends/duckdb.py: new module with _compete_into(), _fetch_qvalues(), and run_tdc(); the q-value SQL uses the same tied-score-aware backward-monotone-minimum logic as qvalues.tdc() - crema/confidence.py: add DuckdbTdcConfidence class (subclass of Confidence) and backend param to module-level assign_confidence() - crema/dataset.py: add backend param to PsmDataset.assign_confidence(); routes backend='duckdb' to DuckdbTdcConfidence, raises ValueError for incompatible method='mixmax' - crema/__init__.py: export DuckdbTdcConfidence - setup.cfg: add [large] extras_require with duckdb>=0.8.0; add [fast] extras for pyarrow>=8.0.0 - tests/unit_tests/test_duckdb.py: 18 tests covering API, output structure, numerical agreement with pandas path, and to_txt compat; all skipped automatically when duckdb is not installed Protein-group estimation is not implemented in the DuckDB backend. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crema/backends/duckdb.py`:
- Around line 50-64: The CREATE TABLE SQL statement in the con.execute call and
other code blocks in the file do not conform to Black's formatting standards,
causing linting failures. Run the Black code formatter on the
crema/backends/duckdb.py file to automatically apply the correct formatting to
the SQL query string and all other affected code blocks referenced in the
comment (lines 95-127, 236-244, 267-273, 284-295). This will ensure the code
passes the CI linting checks.
- Around line 49-60: Define a helper function _qid that takes a string name and
returns it wrapped in double quotes with any internal double quotes escaped by
doubling them (quote becomes two quotes). Then systematically replace all direct
identifier interpolations throughout the file that currently use the pattern
"{variable_name}" with _qid(variable_name) instead. This includes the partition
variable construction on line 49, all score_col variable references, target_col
references, pep_col references, prot_col references, and score_column references
across the various SQL query construction sections in the file. The goal is to
ensure that column names containing double quote characters are properly escaped
and do not cause SQL parse failures.
In `@crema/dataset.py`:
- Around line 218-233: The current code silently falls back to pandas when an
unrecognized backend value is provided (e.g., a typo like "duckbd"), which can
mask configuration mistakes. After the if statement that checks for backend ==
"duckdb", replace the else clause with explicit validation that either accepts
known backend values like "pandas" or raises a ValueError rejecting unknown
backends. This ensures invalid backend values are caught immediately with a
clear error message rather than silently executing with an unintended backend.
In `@tests/unit_tests/test_duckdb.py`:
- Line 12: Apply Black code formatting to the pytest.importorskip call on line
12. The line exceeds Black's default line length limit and needs to be
reformatted, likely by breaking the pytest.importorskip function call with the
"duckdb" argument and reason parameter across multiple lines to meet Black's
formatting standards and unblock the CI checks.
- Around line 181-201: The test test_duckdb_psm_qvalues_match_pandas is
comparing the score column "x" instead of the actual q-values. Replace the
assertion that currently compares p_df["x"] and d_df["x"] with a comparison of
the appropriate q-value column from both dataframes. The test should validate
that the q-values calculated by the DuckDB path match those from the pandas
path, not the original scores. Identify the correct q-value column name from the
confidence_estimates dictionary and update the pd.testing.assert_series_equal
call accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 9c4e1a4e-45b5-4cca-9c10-ee6598ff06a9
📒 Files selected for processing (7)
crema/__init__.pycrema/backends/__init__.pycrema/backends/duckdb.pycrema/confidence.pycrema/dataset.pysetup.cfgtests/unit_tests/test_duckdb.py
- crema/backends/duckdb.py: reformat multiline f-string calls to match Black 26.x style - tests/unit_tests/test_duckdb.py: same Black reformatting; remove noqa: E402 directives (E402 is not in ruff select, so they triggered RUF100 unused-noqa warnings) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- backends/duckdb.py: add _qid() helper to safely escape double-quote
characters in column name identifiers; replace all inline f'"{col}"'
interpolations throughout SQL queries
- dataset.py: replace silent else fallback with explicit elif 'pandas'
+ else raise ValueError for unknown backend values
- tests/unit_tests/test_duckdb.py: fix test_duckdb_psm_qvalues_match_pandas
to compare 'crema q-value' column instead of the raw score column
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Copilot reviewed 6 out of 7 changed files in this pull request and generated no comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Summary
backend='duckdb'parameter toPsmDataset.assign_confidence()and the module-levelassign_confidence()— explicit opt-in, no behavior change for existing codecrema/backends/duckdb.pyimplements the full TDC pipeline (spectrum competition, peptide competition, protein aggregation, FDR + q-values) using DuckDB SQL window functions; handles datasets larger than RAM via out-of-core executionqvalues.tdc()for numerical equivalenceDuckdbTdcConfidenceclass is a drop-in subclass ofConfidence;to_txt()and_prettify_tables()work unchangedsetup.cfg: adds[fast](pyarrow) and[large](duckdb≥0.8.0) optional dependency groupstests/unit_tests/test_duckdb.py; all skipped automatically when duckdb is not installedLimitations
method='tdc';backend='duckdb'withmethod='mixmax'raisesValueErrorTest plan
pytest tests/unit_tests/passes with 91 tests (duckdb tests skipped if package absent)pip install crema[large]installs duckdb; duckdb tests then run and passpsms.assign_confidence(score_column="score", backend="duckdb")produces same accept counts as pandas pathto_txt()on aDuckdbTdcConfidenceobject writes valid output files🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
backendparameter to confidence assignment functions (default:"pandas").largefor DuckDB support.Tests