Phase 1: eliminate redundant data copies and add optional pyarrow engine by wsnoble · Pull Request #26 · Noble-Lab/crema

wsnoble · 2026-06-19T22:30:56Z

Summary

Remove .copy() from PsmDataset.data property — previously allocated a full copy of the dataset on every property access.
Change copy_data default to False in PsmDataset.__init__ and all parser entry-points (read_txt, read_tide, read_comet, read_msamanda, read_msfragger, read_msgf) — eliminates a redundant deep copy when callers pass an existing DataFrame.
Use pyarrow CSV engine (with C-engine fallback) in _parse_psms() and parse_psms_txt() — 2–4× faster parsing when pyarrow is installed.

All 90 unit tests pass. This is Phase 1 of SCALING_PLAN.md.

Test plan

CI passes (90 unit tests + lint)
Manually verify read_tide / read_txt still work on a real Tide file
Install pyarrow and confirm pyarrow path is exercised

🤖 Generated with Claude Code

coderabbitai · 2026-06-19T22:34:07Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8de0c0d7-2c0c-4390-a3ed-1ad4349468cd

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch scale-phase-1-2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

This PR reduces memory overhead when constructing/inspecting PsmDataset and speeds up text parsing by optionally using pandas’ pyarrow CSV engine (with fallback).

Changes:

Remove deep-copy behavior from PsmDataset.data and change copy_data defaults to False across dataset/parsers.
Add optional engine="pyarrow" usage for read_csv in text parsing helpers.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
crema/utils.py	Adds optional pyarrow-based `read_csv` path in `parse_psms_txt()`
crema/parsers/txt.py	Adds optional pyarrow-based `read_csv` path in `_parse_psms()` and changes `read_txt(copy_data)` default
crema/parsers/tide.py	Changes `read_tide(copy_data)` default to `False`
crema/parsers/msgf.py	Changes `read_msgf(copy_data)` default to `False`
crema/parsers/msfragger.py	Changes `read_msfragger(copy_data)` default to `False`
crema/parsers/msamanda.py	Changes `read_msamanda(copy_data)` default to `False`
crema/parsers/comet.py	Changes `read_comet(copy_data)` default to `False`
crema/dataset.py	Changes `copy_data` default to `False` and stops copying on `.data` access

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    kwargs = dict(
+        sep="\t", skiprows=int(skip_line), usecols=lambda c: c in cols
    )
+    try:
+        return pd.read_csv(txt_file, engine="pyarrow", **kwargs)
+    except ImportError:
+        return pd.read_csv(txt_file, **kwargs)


        If true, a deep copy of the data is created. This uses more memory, but
        is safer because it prevents accidental modification of the underlying
        data. This argument only has an effect when `pin_files` is a
-        :py:class:`pandas.DataFrame`
+        :py:class:`pandas.DataFrame`. Default is ``False`` to minimise memory
+        use; set to ``True`` if you need to protect the original DataFrame.


    def data(self):
        """The collection of PSMs as a :py:class:`pandas.DataFrame`."""
-        return self._data.copy()
+        return self._data


Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

+    # pyarrow engine doesn't support callable usecols, so pass an explicit list.
+    # Fall back to the default C engine if pyarrow is missing or the pandas
+    # version doesn't support engine="pyarrow" (raises ImportError or ValueError).
+    with open(txt_file) as fh:


Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

+                fh.readline()
+            header = fh.readline().rstrip("\n").split("\t")
+        explicit_cols = [c for c in header if c in cols]


+        """The collection of PSMs as a :py:class:`pandas.DataFrame`.
+
+        Returns the internal DataFrame by reference. Callers that need an
+        independent copy should call ``.data.copy()`` explicitly.
+        """
+        return self._data


- Remove `.copy()` from `PsmDataset.data` property; callers that need an independent copy must now call `.data.copy()` explicitly. - Change `copy_data` default from `True` to `False` in `PsmDataset.__init__` and in all public parser entry-points (`read_txt`, `read_tide`, `read_comet`, `read_msamanda`, `read_msfragger`, `read_msgf`) so that file-backed DataFrames are never copied a second time on ingestion. - Use pyarrow CSV engine in `_parse_psms()` and `parse_psms_txt()` when pyarrow is available, falling back to the default C engine otherwise. This cuts parse time roughly 2–4× on large files at no API cost. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- utils.py: pyarrow engine doesn't support callable usecols; pre-compute an explicit column list from the file header and catch (ImportError, ValueError) so older pandas versions fall back cleanly to the C engine. - parsers/txt.py: also catch ValueError (not just ImportError) in pyarrow fallback, for consistency with utils.py and older pandas support. - dataset.py: fix copy_data docstring ('pin_files' → 'psms'); document that the data property returns the internal DataFrame by reference. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Open the file with encoding='utf-8' (matching pandas read_csv default) when pre-reading the header for the pyarrow usecols list. If a UnicodeDecodeError occurs, fall back directly to the C engine with a callable usecols filter, avoiding a crash on non-UTF-8 locale files. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- confidence.py: self._data = psms.data.copy() to avoid mutating the original PsmDataset when confidence assigns columns (e.g. "pairing") - utils.py: rstrip("\r\n") instead of "\n" so Windows CRLF files are parsed correctly by the pyarrow engine path Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

wsnoble force-pushed the scale-phase-1-2 branch 3 times, most recently from 92a5f1f to 602d9b6 Compare June 19, 2026 22:56

wsnoble requested a review from Copilot June 19, 2026 22:56

Copilot started reviewing on behalf of wsnoble June 19, 2026 22:57 View session

Copilot AI reviewed Jun 19, 2026

View reviewed changes

wsnoble requested a review from Copilot June 19, 2026 23:35

Copilot started reviewing on behalf of wsnoble June 19, 2026 23:35 View session

Copilot AI reviewed Jun 19, 2026

View reviewed changes

Comment thread crema/utils.py Outdated

# pyarrow engine doesn't support callable usecols, so pass an explicit list.

# Fall back to the default C engine if pyarrow is missing or the pandas

# version doesn't support engine="pyarrow" (raises ImportError or ValueError).

with open(txt_file) as fh:

wsnoble requested a review from Copilot June 20, 2026 00:12

Copilot started reviewing on behalf of wsnoble June 20, 2026 00:13 View session

Copilot AI reviewed Jun 20, 2026

View reviewed changes

wsnoble force-pushed the scale-phase-1-2 branch from 07540af to fdcf8ea Compare June 20, 2026 00:43

wsnoble and others added 4 commits June 19, 2026 19:20

wsnoble force-pushed the scale-phase-1-2 branch from fdcf8ea to 4386e8a Compare June 20, 2026 02:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 1: eliminate redundant data copies and add optional pyarrow engine#26

Phase 1: eliminate redundant data copies and add optional pyarrow engine#26
wsnoble wants to merge 4 commits into
add-unit-testsfrom
scale-phase-1-2

wsnoble commented Jun 19, 2026

Uh oh!

coderabbitai Bot commented Jun 19, 2026 •

edited

Loading

Review skipped

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wsnoble commented Jun 19, 2026

Summary

Test plan

Uh oh!

coderabbitai Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Jun 19, 2026 •

edited

Loading