Phase 4: Parquet I/O and crema convert CLI subcommand by wsnoble · Pull Request #29 · Noble-Lab/crema

wsnoble · 2026-06-20T06:37:33Z

Summary

New crema/parsers/parquet.py: read_parquet() reads one or more Parquet files into a PsmDataset; reuses _convert_target_col from the txt parser for target/decoy column handling
New crema/writers/parquet.py: to_parquet() writes confidence estimates to Parquet files per level (PSMs, peptides, proteins); reuses _get_level_data from txt writer
crema/confidence.py: to_parquet() method added to Confidence base class
crema/crema.py: CLI refactored to use subcommands; assign-confidence (existing behaviour, gains --parquet output flag); new convert subcommand reads any supported format and writes a Parquet file
crema/params.py: restructured with add_subparsers(); graceful __version__ fallback when package is not pip-installed
crema/__init__.py: exports read_parquet and to_parquet
tests/unit_tests/test_parquet.py: 16 tests covering read_parquet, to_parquet, missing-pyarrow error paths, and a txt/Parquet roundtrip score comparison; auto-skipped when pyarrow is absent
No new dependencies: pyarrow is already in [fast] from Phase 1

Test plan

pytest tests/unit_tests/ passes (Parquet tests skipped if pyarrow absent, pass if installed)
pip install crema[fast] — Parquet tests run and pass
crema --help shows assign-confidence and convert subcommands
crema convert --help shows expected arguments
crema assign-confidence --parquet ... writes .parquet output files
crema convert --target-column ... input.txt produces a valid Parquet file readable by read_parquet()

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added Parquet format support for reading and writing PSM data
- Introduced read_parquet() function to load PSM data from Parquet files
- Added to_parquet() method to export confidence estimates to Parquet format
- CLI now organized with subcommands; new convert subcommand converts delimited text PSM data to Parquet
- assign-confidence supports --parquet flag to output results directly to Parquet
Tests
- Comprehensive test coverage added for Parquet I/O operations

Adds Parquet as a first-class input/output format using pyarrow (pip install crema[fast], no new dependency). - crema/parsers/parquet.py: read_parquet() reads one or more Parquet files into a PsmDataset; reuses _convert_target_col from txt parser - crema/writers/parquet.py: to_parquet() writes confidence estimates as Parquet files; reuses _get_level_data from txt writer - crema/confidence.py: add to_parquet() method to Confidence base class - crema/crema.py: refactor CLI to use subcommands; assign-confidence gains --parquet flag for Parquet output; new convert subcommand reads any supported format and writes a single Parquet file - crema/params.py: restructured with add_subparsers(); graceful fallback for __version__ when package is not installed - crema/__init__.py: export read_parquet and to_parquet - tests/unit_tests/test_parquet.py: 16 tests covering read_parquet, to_parquet, and a txt/Parquet roundtrip comparison; auto-skipped when pyarrow is absent Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai · 2026-06-20T06:37:41Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: eb8efd29-1439-4a33-a930-b0fead738125

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch scale-phase-4

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

Adds Parquet input/output support and refactors the CLI into subcommands to enable faster workflows (including a new convert command and Parquet output option).

Changes:

Introduces crema.parsers.parquet.read_parquet() to load PSMs from Parquet into PsmDataset.
Introduces crema.writers.parquet.to_parquet() and adds Confidence.to_parquet() for Parquet result output.
Refactors CLI into assign-confidence and convert subcommands and updates argument parsing accordingly.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`tests/unit_tests/test_parquet.py`	Adds unit tests covering Parquet read/write behaviors and roundtrips.
`crema/writers/parquet.py`	New Parquet writer for confidence estimates.
`crema/parsers/parquet.py`	New Parquet parser producing `PsmDataset` from Parquet files.
`crema/params.py`	Refactors CLI args into subcommands and adds Parquet/convert flags.
`crema/crema.py`	Implements subcommand routing, Parquet output flag, and convert command logic.
`crema/confidence.py`	Adds `Confidence.to_parquet()` wrapper.
`crema/__init__.py`	Exports `read_parquet` and `to_parquet`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

    conf = psms.assign_confidence(
        score_column=args.score,
        eval_fdr=args.eval_fdr,
        method=args.method,
    )


+    if all(str(f).endswith(".parquet") for f in files):
+        raise ValueError(
+            "Parquet input requires explicit column arguments. "
+            "Use 'crema convert' output with 'crema assign-confidence'."
+        )


        help=(
            "One or more collection of peptide-spectrum matches (PSMs) in the "
-            "mzTab, Tide tab-delimited formats, MSGF+ tsv file, MSAmanda csv "
-            "output, Morpheus txt output, or generic delimited text format."
+            "mzTab, Tide tab-delimited, MSGF+ tsv, MSAmanda csv, Morpheus txt, "
+            "generic delimited text, or Parquet format."
        ),


+        help=(
+            "One or more PSM files to convert. Supported input formats: "
+            "Tide tab-delimited, MSGF+ tsv, MSAmanda csv, Comet, MSFragger, "
+            "generic delimited text, mzTab, pepXML."
+        ),


coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crema/crema.py`:
- Around line 151-160: The PSM reading block in the convert function always uses
read_txt() regardless of input format, but the command's help text advertises
support for pepXML, mzTab, and other formats. Either implement format-specific
routing that detects the input file format and calls the appropriate reader (a
pepXML reader for .pepXML files, an mzTab reader for .mzTab files, and read_txt
for delimited text), or update the advertised help documentation to only claim
support for delimited text inputs. Ensure consistency between what the convert
command advertises as supported formats and what the actual PSM reading
implementation handles.
- Around line 140-145: The logging configuration is creating duplicate log
output by adding a redundant stream handler. The logging.basicConfig() call
already configures a stream handler to stderr when no filename is specified, so
the subsequent logging.getLogger().addHandler(logging.StreamHandler()) line is
unnecessary and causes duplication. Remove the line that calls
addHandler(logging.StreamHandler()) as it duplicates the handler already created
by basicConfig.
- Around line 121-125: The assign_confidence function call in crema.py is
missing four parameters that are parsed from CLI arguments but never forwarded.
Add the four missing parameters to the psms.assign_confidence call by including
threshold, pep_fdr_type, prot_fdr_type, and desc as keyword arguments, each
mapped from the corresponding args attribute (args.threshold, args.pep_fdr_type,
args.prot_fdr_type, and args.desc) to ensure user-provided CLI values are
properly used by the function.

In `@crema/params.py`:
- Around line 75-79: The help text for the PSM input parameter in
crema/params.py advertises Parquet format as a supported input type, but the
_run_assign_confidence() function in crema/crema.py rejects Parquet files with a
hard error, creating a contradiction between advertised and actual
functionality. Remove "or Parquet format" from the help string (the section in
the diff starting with "One or more collection of peptide-spectrum matches...")
to align the CLI documentation with the actual runtime behavior.

In `@crema/parsers/parquet.py`:
- Around line 84-93: The PsmDataset constructor call in the function containing
this code block hardcodes copy_data=False, ignoring any copy_data parameter that
callers may pass. Modify the function signature to accept a copy_data parameter
(with an appropriate default value) and then pass that parameter to the
PsmDataset constructor instead of the hardcoded False value. This will allow
callers to control whether data is deep-copied as documented.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 315cd4d2-8ae2-470d-83e2-73e14cfdd5c9

📥 Commits

Reviewing files that changed from the base of the PR and between d8432ee and 3834de4.

📒 Files selected for processing (7)

crema/__init__.py
crema/confidence.py
crema/crema.py
crema/params.py
crema/parsers/parquet.py
crema/writers/parquet.py
tests/unit_tests/test_parquet.py

- crema/crema.py: forward threshold, pep_fdr_type, prot_fdr_type, and desc from CLI args to assign_confidence(); remove duplicate stream handler in _run_convert (basicConfig already sets one up) - crema/params.py: remove "Parquet format" from assign-confidence input help (Parquet input is not supported); narrow convert help text to "delimited text" only, matching what read_txt() actually handles - crema/parsers/parquet.py: pass copy_data parameter through to PsmDataset instead of hardcoding False Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

crema/crema.py:88

_auto_read() does not include read_txt in its parser fallback list, even though assign-confidence help text says it supports “generic delimited text format”. As a result, generic delimited inputs that previously worked will now raise ValueError("Unrecognized file type.").

    readers = [
        read_tide,
        read_msgf,
        read_msamanda,
        read_comet,
        read_msfragger,
        read_pepxml,
        read_mztab,
    ]

+    if all(str(f).endswith(".parquet") for f in files):
+        raise ValueError(
+            "Parquet input requires explicit column arguments. "
+            "Use 'crema convert' output with 'crema assign-confidence'."
+        )


+    ac.add_argument(
        "-t",
        "--threshold",
        type=float,
        default=0.01,
        help=(
            "The FDR threshold for accepting discoveries. Default is 0.01. "
-            "If 'q-value' is chosen, then “ accept”  column is replaced "
+            "If 'q-value' is chosen, then 'accept' column is replaced "
            "with 'crema q-value'."
        ),
    )


+    logging.info("Reading PSMs...")
+    psms = read_txt(
+        args.psm_files,
+        target_column=args.target_column,
+        spectrum_columns=args.spectrum_columns,
+        score_columns=args.score_columns,
+        peptide_column=args.peptide_column,
+        protein_column=args.protein_column,
+        protein_delim=args.protein_delim,
+    )


+    out_files = []
+    for level, qval_list in results.items():
+        out_file = str(file_base) + f".{level}.parquet"
+        pd.concat(qval_list).to_parquet(out_file, index=False)
+        out_files.append(out_file)


The _auto_read() refactor accidentally dropped read_txt from the readers list, breaking generic delimited text input to assign-confidence. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

crema/params.py:88

--score is defined with nargs='+', so argparse returns a list (e.g., ['x']). That list is passed through to PsmDataset.assign_confidence(score_column=...), but the confidence code expects a single column name (string); passing a list can break q-value calculation when desc=None (default).

    ac.add_argument(
        "-s",
        "--score",
        type=str,
        nargs="+",
        default=None,
        help=(

+    ac.add_argument(
        "-t",
        "--threshold",
        type=float,
        default=0.01,
        help=(
            "The FDR threshold for accepting discoveries. Default is 0.01. "
-            "If 'q-value' is chosen, then “ accept”  column is replaced "
+            "If 'q-value' is chosen, then 'accept' column is replaced "
            "with 'crema q-value'."
        ),


+    if all(str(f).endswith(".parquet") for f in files):
+        raise ValueError(
+            "Parquet input requires explicit column arguments. "
+            "Use 'crema convert' output with 'crema assign-confidence'."
+        )


+    logging.info("Writing Parquet to %s...", out_path)
+    psms.data.to_parquet(out_path, index=False)


args.score is a list (nargs='+') but assign_confidence expects a string or None. Unwrap single-element lists to the string value; treat multi-element or absent lists as None so crema auto-selects the best score column. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

crema/params.py:86

--score is declared as accepting one or more values, and the help text says the “best will be selected automatically” among the provided columns. However, the CLI implementation only supports a single score string (and otherwise triggers global auto-selection), so passing multiple values is misleading and not supported as documented.

Consider restricting this CLI flag to a single column (no nargs) and updating the help text accordingly (users can omit --score to let crema choose the best score automatically).

    ac.add_argument(
        "-s",
        "--score",
        type=str,
        nargs="+",

+    # args.score is None or a list from nargs='+'; assign_confidence expects
+    # a single string or None (None triggers automatic best-score selection).
+    score = args.score[0] if args.score and len(args.score) == 1 else None
+


+    if all(str(f).endswith(".parquet") for f in files):
+        raise ValueError(
+            "Parquet input requires explicit column arguments. "
+            "Use 'crema convert' output with 'crema assign-confidence'."
+        )


+    logging.info("Writing Parquet to %s...", out_path)
+    psms.data.to_parquet(out_path, index=False)


+    logging.info("Reading PSMs...")
+    psms = read_txt(
+        args.psm_files,
+        target_column=args.target_column,
+        spectrum_columns=args.spectrum_columns,
+        score_columns=args.score_columns,
+        peptide_column=args.peptide_column,
+        protein_column=args.protein_column,
+        protein_delim=args.protein_delim,
+    )


- Fix misleading error message in _auto_read(): Parquet guidance now points to crema.read_parquet() in Python, not the CLI - Simplify --score to single-value arg (nargs removed); the underlying API takes one score_column, so nargs='+' was misleading - Remove q-value mention from --threshold help (type is float; q-value mode is Python-API-only) - Add explicit pyarrow ImportError with install hint in _run_convert() before writing Parquet, matching the rest of the Parquet I/O paths Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

crema/params.py (1)
57-178: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Legacy CLI invocation is broken by mandatory subcommand parsing.

Line 57 moves parsing to subcommands, and Lines 71-178 relocate assign-confidence args under that subparser only. Existing calls like crema --output_dir ... -e ... file.txt now fail argument parsing (before runtime fallback), even though current system usage still relies on that form. Please add a backward-compatible fallback that maps “no explicit command” invocations to assign-confidence before full parse.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crema/params.py` around lines 57 - 178, The refactoring to mandatory
subcommand parsing with add_subparsers() has broken backward compatibility for
legacy invocations that omit the explicit subcommand. To fix this, add logic
that intercepts the command-line arguments before the main parser's parse_args()
call and automatically prepends "assign-confidence" to the arguments list when
no recognized subcommand is detected at the start of the arguments. This allows
the legacy invocation style (crema --output_dir ... -e ... file.txt) to be
transparently converted to the new format (crema assign-confidence --output_dir
... -e ... file.txt) before parsing, maintaining backward compatibility while
supporting the new subcommand structure.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@crema/params.py`:
- Around line 57-178: The refactoring to mandatory subcommand parsing with
add_subparsers() has broken backward compatibility for legacy invocations that
omit the explicit subcommand. To fix this, add logic that intercepts the
command-line arguments before the main parser's parse_args() call and
automatically prepends "assign-confidence" to the arguments list when no
recognized subcommand is detected at the start of the arguments. This allows the
legacy invocation style (crema --output_dir ... -e ... file.txt) to be
transparently converted to the new format (crema assign-confidence --output_dir
... -e ... file.txt) before parsing, maintaining backward compatibility while
supporting the new subcommand structure.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4fe5bdd6-4fe8-46b8-8785-d2e0a49fd493

📥 Commits

Reviewing files that changed from the base of the PR and between 3834de4 and fe15da8.

📒 Files selected for processing (3)

crema/crema.py
crema/params.py
crema/parsers/parquet.py

🚧 Files skipped from review as they are similar to previous changes (2)

crema/parsers/parquet.py
crema/crema.py

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

+    try:
+        assert not isinstance(conf, str)
+        iter(conf)
+    except TypeError:
+        conf = [conf]
+    except AssertionError:
+        raise ValueError("'conf' should be a Confidence object, not a string.")


+    conf.to_parquet(output_dir=tmp_path)
+    assert (tmp_path / "crema.psms.parquet").exists()
+    assert (tmp_path / "crema.peptides.parquet").exists()
+    assert (tmp_path / "crema.proteins.parquet").exists()
+


+    conf.to_parquet(output_dir=tmp_path, decoys=True)
+    assert (tmp_path / "crema.decoy.psms.parquet").exists()
+


+    logging.info("Reading PSMs...")
+    psms = read_txt(
+        args.psm_files,
+        target_column=args.target_column,
+        spectrum_columns=args.spectrum_columns,
+        score_columns=args.score_columns,


Legacy crema invocations (crema file.txt ...) now transparently map to (crema assign-confidence file.txt ...) by injecting the subcommand when no recognized subcommand is present in argv. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Replace unsafe assert-based string check with explicit isinstance in to_parquet() — assert is stripped with python -O - Add protein_groups assertions to to_parquet file-creation tests - Add decoy peptides/proteins assertions to decoys=True test Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

+        argv = sys.argv[1:]
+        if argv and argv[0] not in ("assign-confidence", "convert"):
+            sys.argv.insert(1, "assign-confidence")


+    try:
+        assert not isinstance(conf, str)
+        iter(conf)
+    except TypeError:
+        conf = [conf]
+    except AssertionError:
+        raise ValueError("'conf' should be a Confidence object, not a string.")


+    psms = read_txt(
+        args.psm_files,
+        target_column=args.target_column,
+        spectrum_columns=args.spectrum_columns,
+        score_columns=args.score_columns,


crema --help was being rewritten to crema assign-confidence --help, showing subcommand help instead of the top-level help. Exclude help and version flags from the subcommand injection. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

+        # Backward compatibility: if the first argument is not a known
+        # subcommand, default to 'assign-confidence' so legacy invocations
+        # like ``crema file.txt ...`` continue to work.
+        argv = sys.argv[1:]
+        if argv and argv[0] not in ("assign-confidence", "convert"):
+            sys.argv.insert(1, "assign-confidence")


+    ac.add_argument(
        "-s",
        "--score",
        type=str,
-        nargs="+",
        default=None,
        help=(
-            "One or more columns that indicate possible scores by which to "
-            "rank the PSMs. If more than one is provided, the best will be "
-            "selected automatically. If none are provided, crema will try all "
-            "available scores."
+            "The column to use as the PSM score. If not provided, crema will "
+            "try all available scores and select the best one automatically."
        ),
    )


+    logging.info("Reading PSMs...")
+    psms = read_txt(
+        args.psm_files,
+        target_column=args.target_column,
+        spectrum_columns=args.spectrum_columns,
+        score_columns=args.score_columns,
+        peptide_column=args.peptide_column,
+        protein_column=args.protein_column,
+        protein_delim=args.protein_delim,
+    )


+    out_files = []
+    for level, qval_list in results.items():
+        out_file = str(file_base) + f".{level}.parquet"
+        pd.concat(qval_list).to_parquet(out_file, index=False)
+        out_files.append(out_file)


Replace the explicit help-flag allowlist with a startswith('-') check, so all option flags (not just -h/--help/--version) leave the top-level parser untouched. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

    parser = argparse.ArgumentParser(
        description=desc, formatter_class=CremaHelpFormatter
    )


+    logging.info("Reading PSMs...")
+    psms = read_txt(
+        args.psm_files,
+        target_column=args.target_column,
+        spectrum_columns=args.spectrum_columns,
+        score_columns=args.score_columns,
+        peptide_column=args.peptide_column,
+        protein_column=args.protein_column,
+        protein_delim=args.protein_delim,
+    )


Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

+    ac.add_argument(
        "-p",
        "--pep_fdr_type",
        type=str,
        default="psm-only",
        choices=["psm-only", "peptide-only", "psm-peptide"],
-        help="The peptide-level FDR estimation method to use."
-        "Default is 'psm-peptide'",
+        help="The peptide-level FDR estimation method to use. "
+        "Default is 'psm-only'",
    )


+    logging.info("Reading PSMs...")
+    psms = read_txt(
+        args.psm_files,
+        target_column=args.target_column,
+        spectrum_columns=args.spectrum_columns,
+        score_columns=args.score_columns,
+        peptide_column=args.peptide_column,
+        protein_column=args.protein_column,
+        protein_delim=args.protein_delim,
+    )


wsnoble requested a review from Copilot June 20, 2026 21:53

Copilot started reviewing on behalf of wsnoble June 20, 2026 21:54 View session

Copilot AI reviewed Jun 20, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 20, 2026

View reviewed changes

Comment thread crema/crema.py

Comment thread crema/crema.py

Comment thread crema/crema.py

Comment thread crema/params.py

Comment thread crema/parsers/parquet.py

wsnoble requested a review from Copilot June 21, 2026 15:01

Copilot started reviewing on behalf of wsnoble June 21, 2026 15:01 View session