Phase 4: Parquet I/O and crema convert CLI subcommand#29
Conversation
Adds Parquet as a first-class input/output format using pyarrow (pip install crema[fast], no new dependency). - crema/parsers/parquet.py: read_parquet() reads one or more Parquet files into a PsmDataset; reuses _convert_target_col from txt parser - crema/writers/parquet.py: to_parquet() writes confidence estimates as Parquet files; reuses _get_level_data from txt writer - crema/confidence.py: add to_parquet() method to Confidence base class - crema/crema.py: refactor CLI to use subcommands; assign-confidence gains --parquet flag for Parquet output; new convert subcommand reads any supported format and writes a single Parquet file - crema/params.py: restructured with add_subparsers(); graceful fallback for __version__ when package is not installed - crema/__init__.py: export read_parquet and to_parquet - tests/unit_tests/test_parquet.py: 16 tests covering read_parquet, to_parquet, and a txt/Parquet roundtrip comparison; auto-skipped when pyarrow is absent Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
Adds Parquet input/output support and refactors the CLI into subcommands to enable faster workflows (including a new convert command and Parquet output option).
Changes:
- Introduces
crema.parsers.parquet.read_parquet()to load PSMs from Parquet intoPsmDataset. - Introduces
crema.writers.parquet.to_parquet()and addsConfidence.to_parquet()for Parquet result output. - Refactors CLI into
assign-confidenceandconvertsubcommands and updates argument parsing accordingly.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
tests/unit_tests/test_parquet.py |
Adds unit tests covering Parquet read/write behaviors and roundtrips. |
crema/writers/parquet.py |
New Parquet writer for confidence estimates. |
crema/parsers/parquet.py |
New Parquet parser producing PsmDataset from Parquet files. |
crema/params.py |
Refactors CLI args into subcommands and adds Parquet/convert flags. |
crema/crema.py |
Implements subcommand routing, Parquet output flag, and convert command logic. |
crema/confidence.py |
Adds Confidence.to_parquet() wrapper. |
crema/__init__.py |
Exports read_parquet and to_parquet. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| conf = psms.assign_confidence( | ||
| score_column=args.score, | ||
| eval_fdr=args.eval_fdr, | ||
| method=args.method, | ||
| ) |
| if all(str(f).endswith(".parquet") for f in files): | ||
| raise ValueError( | ||
| "Parquet input requires explicit column arguments. " | ||
| "Use 'crema convert' output with 'crema assign-confidence'." | ||
| ) |
| help=( | ||
| "One or more collection of peptide-spectrum matches (PSMs) in the " | ||
| "mzTab, Tide tab-delimited formats, MSGF+ tsv file, MSAmanda csv " | ||
| "output, Morpheus txt output, or generic delimited text format." | ||
| "mzTab, Tide tab-delimited, MSGF+ tsv, MSAmanda csv, Morpheus txt, " | ||
| "generic delimited text, or Parquet format." | ||
| ), |
| help=( | ||
| "One or more PSM files to convert. Supported input formats: " | ||
| "Tide tab-delimited, MSGF+ tsv, MSAmanda csv, Comet, MSFragger, " | ||
| "generic delimited text, mzTab, pepXML." | ||
| ), |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crema/crema.py`:
- Around line 151-160: The PSM reading block in the convert function always uses
read_txt() regardless of input format, but the command's help text advertises
support for pepXML, mzTab, and other formats. Either implement format-specific
routing that detects the input file format and calls the appropriate reader (a
pepXML reader for .pepXML files, an mzTab reader for .mzTab files, and read_txt
for delimited text), or update the advertised help documentation to only claim
support for delimited text inputs. Ensure consistency between what the convert
command advertises as supported formats and what the actual PSM reading
implementation handles.
- Around line 140-145: The logging configuration is creating duplicate log
output by adding a redundant stream handler. The logging.basicConfig() call
already configures a stream handler to stderr when no filename is specified, so
the subsequent logging.getLogger().addHandler(logging.StreamHandler()) line is
unnecessary and causes duplication. Remove the line that calls
addHandler(logging.StreamHandler()) as it duplicates the handler already created
by basicConfig.
- Around line 121-125: The assign_confidence function call in crema.py is
missing four parameters that are parsed from CLI arguments but never forwarded.
Add the four missing parameters to the psms.assign_confidence call by including
threshold, pep_fdr_type, prot_fdr_type, and desc as keyword arguments, each
mapped from the corresponding args attribute (args.threshold, args.pep_fdr_type,
args.prot_fdr_type, and args.desc) to ensure user-provided CLI values are
properly used by the function.
In `@crema/params.py`:
- Around line 75-79: The help text for the PSM input parameter in
crema/params.py advertises Parquet format as a supported input type, but the
_run_assign_confidence() function in crema/crema.py rejects Parquet files with a
hard error, creating a contradiction between advertised and actual
functionality. Remove "or Parquet format" from the help string (the section in
the diff starting with "One or more collection of peptide-spectrum matches...")
to align the CLI documentation with the actual runtime behavior.
In `@crema/parsers/parquet.py`:
- Around line 84-93: The PsmDataset constructor call in the function containing
this code block hardcodes copy_data=False, ignoring any copy_data parameter that
callers may pass. Modify the function signature to accept a copy_data parameter
(with an appropriate default value) and then pass that parameter to the
PsmDataset constructor instead of the hardcoded False value. This will allow
callers to control whether data is deep-copied as documented.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 315cd4d2-8ae2-470d-83e2-73e14cfdd5c9
📒 Files selected for processing (7)
crema/__init__.pycrema/confidence.pycrema/crema.pycrema/params.pycrema/parsers/parquet.pycrema/writers/parquet.pytests/unit_tests/test_parquet.py
- crema/crema.py: forward threshold, pep_fdr_type, prot_fdr_type, and desc from CLI args to assign_confidence(); remove duplicate stream handler in _run_convert (basicConfig already sets one up) - crema/params.py: remove "Parquet format" from assign-confidence input help (Parquet input is not supported); narrow convert help text to "delimited text" only, matching what read_txt() actually handles - crema/parsers/parquet.py: pass copy_data parameter through to PsmDataset instead of hardcoding False Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (1)
crema/crema.py:88
_auto_read()does not includeread_txtin its parser fallback list, even thoughassign-confidencehelp text says it supports “generic delimited text format”. As a result, generic delimited inputs that previously worked will now raiseValueError("Unrecognized file type.").
readers = [
read_tide,
read_msgf,
read_msamanda,
read_comet,
read_msfragger,
read_pepxml,
read_mztab,
]
| if all(str(f).endswith(".parquet") for f in files): | ||
| raise ValueError( | ||
| "Parquet input requires explicit column arguments. " | ||
| "Use 'crema convert' output with 'crema assign-confidence'." | ||
| ) |
| ac.add_argument( | ||
| "-t", | ||
| "--threshold", | ||
| type=float, | ||
| default=0.01, | ||
| help=( | ||
| "The FDR threshold for accepting discoveries. Default is 0.01. " | ||
| "If 'q-value' is chosen, then “ accept” column is replaced " | ||
| "If 'q-value' is chosen, then 'accept' column is replaced " | ||
| "with 'crema q-value'." | ||
| ), | ||
| ) |
| logging.info("Reading PSMs...") | ||
| psms = read_txt( | ||
| args.psm_files, | ||
| target_column=args.target_column, | ||
| spectrum_columns=args.spectrum_columns, | ||
| score_columns=args.score_columns, | ||
| peptide_column=args.peptide_column, | ||
| protein_column=args.protein_column, | ||
| protein_delim=args.protein_delim, | ||
| ) |
| out_files = [] | ||
| for level, qval_list in results.items(): | ||
| out_file = str(file_base) + f".{level}.parquet" | ||
| pd.concat(qval_list).to_parquet(out_file, index=False) | ||
| out_files.append(out_file) |
The _auto_read() refactor accidentally dropped read_txt from the readers list, breaking generic delimited text input to assign-confidence. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
crema/params.py:88
--scoreis defined withnargs='+', so argparse returns a list (e.g.,['x']). That list is passed through toPsmDataset.assign_confidence(score_column=...), but the confidence code expects a single column name (string); passing a list can break q-value calculation whendesc=None(default).
ac.add_argument(
"-s",
"--score",
type=str,
nargs="+",
default=None,
help=(
| ac.add_argument( | ||
| "-t", | ||
| "--threshold", | ||
| type=float, | ||
| default=0.01, | ||
| help=( | ||
| "The FDR threshold for accepting discoveries. Default is 0.01. " | ||
| "If 'q-value' is chosen, then “ accept” column is replaced " | ||
| "If 'q-value' is chosen, then 'accept' column is replaced " | ||
| "with 'crema q-value'." | ||
| ), |
| if all(str(f).endswith(".parquet") for f in files): | ||
| raise ValueError( | ||
| "Parquet input requires explicit column arguments. " | ||
| "Use 'crema convert' output with 'crema assign-confidence'." | ||
| ) |
| logging.info("Writing Parquet to %s...", out_path) | ||
| psms.data.to_parquet(out_path, index=False) |
args.score is a list (nargs='+') but assign_confidence expects a string or None. Unwrap single-element lists to the string value; treat multi-element or absent lists as None so crema auto-selects the best score column. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (1)
crema/params.py:86
--scoreis declared as accepting one or more values, and the help text says the “best will be selected automatically” among the provided columns. However, the CLI implementation only supports a single score string (and otherwise triggers global auto-selection), so passing multiple values is misleading and not supported as documented.
Consider restricting this CLI flag to a single column (no nargs) and updating the help text accordingly (users can omit --score to let crema choose the best score automatically).
ac.add_argument(
"-s",
"--score",
type=str,
nargs="+",
| # args.score is None or a list from nargs='+'; assign_confidence expects | ||
| # a single string or None (None triggers automatic best-score selection). | ||
| score = args.score[0] if args.score and len(args.score) == 1 else None | ||
|
|
| if all(str(f).endswith(".parquet") for f in files): | ||
| raise ValueError( | ||
| "Parquet input requires explicit column arguments. " | ||
| "Use 'crema convert' output with 'crema assign-confidence'." | ||
| ) |
| logging.info("Writing Parquet to %s...", out_path) | ||
| psms.data.to_parquet(out_path, index=False) |
| logging.info("Reading PSMs...") | ||
| psms = read_txt( | ||
| args.psm_files, | ||
| target_column=args.target_column, | ||
| spectrum_columns=args.spectrum_columns, | ||
| score_columns=args.score_columns, | ||
| peptide_column=args.peptide_column, | ||
| protein_column=args.protein_column, | ||
| protein_delim=args.protein_delim, | ||
| ) |
- Fix misleading error message in _auto_read(): Parquet guidance now points to crema.read_parquet() in Python, not the CLI - Simplify --score to single-value arg (nargs removed); the underlying API takes one score_column, so nargs='+' was misleading - Remove q-value mention from --threshold help (type is float; q-value mode is Python-API-only) - Add explicit pyarrow ImportError with install hint in _run_convert() before writing Parquet, matching the rest of the Parquet I/O paths Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crema/params.py (1)
57-178:⚠️ Potential issue | 🟠 Major | ⚡ Quick winLegacy CLI invocation is broken by mandatory subcommand parsing.
Line 57 moves parsing to subcommands, and Lines 71-178 relocate
assign-confidenceargs under that subparser only. Existing calls likecrema --output_dir ... -e ... file.txtnow fail argument parsing (before runtime fallback), even though current system usage still relies on that form. Please add a backward-compatible fallback that maps “no explicit command” invocations toassign-confidencebefore full parse.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crema/params.py` around lines 57 - 178, The refactoring to mandatory subcommand parsing with add_subparsers() has broken backward compatibility for legacy invocations that omit the explicit subcommand. To fix this, add logic that intercepts the command-line arguments before the main parser's parse_args() call and automatically prepends "assign-confidence" to the arguments list when no recognized subcommand is detected at the start of the arguments. This allows the legacy invocation style (crema --output_dir ... -e ... file.txt) to be transparently converted to the new format (crema assign-confidence --output_dir ... -e ... file.txt) before parsing, maintaining backward compatibility while supporting the new subcommand structure.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@crema/params.py`:
- Around line 57-178: The refactoring to mandatory subcommand parsing with
add_subparsers() has broken backward compatibility for legacy invocations that
omit the explicit subcommand. To fix this, add logic that intercepts the
command-line arguments before the main parser's parse_args() call and
automatically prepends "assign-confidence" to the arguments list when no
recognized subcommand is detected at the start of the arguments. This allows the
legacy invocation style (crema --output_dir ... -e ... file.txt) to be
transparently converted to the new format (crema assign-confidence --output_dir
... -e ... file.txt) before parsing, maintaining backward compatibility while
supporting the new subcommand structure.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4fe5bdd6-4fe8-46b8-8785-d2e0a49fd493
📒 Files selected for processing (3)
crema/crema.pycrema/params.pycrema/parsers/parquet.py
🚧 Files skipped from review as they are similar to previous changes (2)
- crema/parsers/parquet.py
- crema/crema.py
| try: | ||
| assert not isinstance(conf, str) | ||
| iter(conf) | ||
| except TypeError: | ||
| conf = [conf] | ||
| except AssertionError: | ||
| raise ValueError("'conf' should be a Confidence object, not a string.") |
| conf.to_parquet(output_dir=tmp_path) | ||
| assert (tmp_path / "crema.psms.parquet").exists() | ||
| assert (tmp_path / "crema.peptides.parquet").exists() | ||
| assert (tmp_path / "crema.proteins.parquet").exists() | ||
|
|
| conf.to_parquet(output_dir=tmp_path, decoys=True) | ||
| assert (tmp_path / "crema.decoy.psms.parquet").exists() | ||
|
|
| logging.info("Reading PSMs...") | ||
| psms = read_txt( | ||
| args.psm_files, | ||
| target_column=args.target_column, | ||
| spectrum_columns=args.spectrum_columns, | ||
| score_columns=args.score_columns, |
Legacy crema invocations (crema file.txt ...) now transparently map to (crema assign-confidence file.txt ...) by injecting the subcommand when no recognized subcommand is present in argv. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replace unsafe assert-based string check with explicit isinstance in to_parquet() — assert is stripped with python -O - Add protein_groups assertions to to_parquet file-creation tests - Add decoy peptides/proteins assertions to decoys=True test Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| argv = sys.argv[1:] | ||
| if argv and argv[0] not in ("assign-confidence", "convert"): | ||
| sys.argv.insert(1, "assign-confidence") |
| try: | ||
| assert not isinstance(conf, str) | ||
| iter(conf) | ||
| except TypeError: | ||
| conf = [conf] | ||
| except AssertionError: | ||
| raise ValueError("'conf' should be a Confidence object, not a string.") |
| psms = read_txt( | ||
| args.psm_files, | ||
| target_column=args.target_column, | ||
| spectrum_columns=args.spectrum_columns, | ||
| score_columns=args.score_columns, |
crema --help was being rewritten to crema assign-confidence --help, showing subcommand help instead of the top-level help. Exclude help and version flags from the subcommand injection. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| # Backward compatibility: if the first argument is not a known | ||
| # subcommand, default to 'assign-confidence' so legacy invocations | ||
| # like ``crema file.txt ...`` continue to work. | ||
| argv = sys.argv[1:] | ||
| if argv and argv[0] not in ("assign-confidence", "convert"): | ||
| sys.argv.insert(1, "assign-confidence") |
| ac.add_argument( | ||
| "-s", | ||
| "--score", | ||
| type=str, | ||
| nargs="+", | ||
| default=None, | ||
| help=( | ||
| "One or more columns that indicate possible scores by which to " | ||
| "rank the PSMs. If more than one is provided, the best will be " | ||
| "selected automatically. If none are provided, crema will try all " | ||
| "available scores." | ||
| "The column to use as the PSM score. If not provided, crema will " | ||
| "try all available scores and select the best one automatically." | ||
| ), | ||
| ) |
| logging.info("Reading PSMs...") | ||
| psms = read_txt( | ||
| args.psm_files, | ||
| target_column=args.target_column, | ||
| spectrum_columns=args.spectrum_columns, | ||
| score_columns=args.score_columns, | ||
| peptide_column=args.peptide_column, | ||
| protein_column=args.protein_column, | ||
| protein_delim=args.protein_delim, | ||
| ) |
| out_files = [] | ||
| for level, qval_list in results.items(): | ||
| out_file = str(file_base) + f".{level}.parquet" | ||
| pd.concat(qval_list).to_parquet(out_file, index=False) | ||
| out_files.append(out_file) |
Replace the explicit help-flag allowlist with a startswith('-') check,
so all option flags (not just -h/--help/--version) leave the top-level
parser untouched.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| parser = argparse.ArgumentParser( | ||
| description=desc, formatter_class=CremaHelpFormatter | ||
| ) |
| logging.info("Reading PSMs...") | ||
| psms = read_txt( | ||
| args.psm_files, | ||
| target_column=args.target_column, | ||
| spectrum_columns=args.spectrum_columns, | ||
| score_columns=args.score_columns, | ||
| peptide_column=args.peptide_column, | ||
| protein_column=args.protein_column, | ||
| protein_delim=args.protein_delim, | ||
| ) |
| ac.add_argument( | ||
| "-p", | ||
| "--pep_fdr_type", | ||
| type=str, | ||
| default="psm-only", | ||
| choices=["psm-only", "peptide-only", "psm-peptide"], | ||
| help="The peptide-level FDR estimation method to use." | ||
| "Default is 'psm-peptide'", | ||
| help="The peptide-level FDR estimation method to use. " | ||
| "Default is 'psm-only'", | ||
| ) |
| logging.info("Reading PSMs...") | ||
| psms = read_txt( | ||
| args.psm_files, | ||
| target_column=args.target_column, | ||
| spectrum_columns=args.spectrum_columns, | ||
| score_columns=args.score_columns, | ||
| peptide_column=args.peptide_column, | ||
| protein_column=args.protein_column, | ||
| protein_delim=args.protein_delim, | ||
| ) |
Summary
crema/parsers/parquet.py:read_parquet()reads one or more Parquet files into aPsmDataset; reuses_convert_target_colfrom the txt parser for target/decoy column handlingcrema/writers/parquet.py:to_parquet()writes confidence estimates to Parquet files per level (PSMs, peptides, proteins); reuses_get_level_datafrom txt writercrema/confidence.py:to_parquet()method added toConfidencebase classcrema/crema.py: CLI refactored to use subcommands;assign-confidence(existing behaviour, gains--parquetoutput flag); newconvertsubcommand reads any supported format and writes a Parquet filecrema/params.py: restructured withadd_subparsers(); graceful__version__fallback when package is not pip-installedcrema/__init__.py: exportsread_parquetandto_parquettests/unit_tests/test_parquet.py: 16 tests coveringread_parquet,to_parquet, missing-pyarrow error paths, and a txt/Parquet roundtrip score comparison; auto-skipped when pyarrow is absent[fast]from Phase 1Test plan
pytest tests/unit_tests/passes (Parquet tests skipped if pyarrow absent, pass if installed)pip install crema[fast]— Parquet tests run and passcrema --helpshowsassign-confidenceandconvertsubcommandscrema convert --helpshows expected argumentscrema assign-confidence --parquet ...writes.parquetoutput filescrema convert --target-column ... input.txtproduces a valid Parquet file readable byread_parquet()🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
read_parquet()function to load PSM data from Parquet filesto_parquet()method to export confidence estimates to Parquet formatconvertsubcommand converts delimited text PSM data to Parquetassign-confidencesupports--parquetflag to output results directly to ParquetTests