Skip to content

Phase 1: eliminate redundant data copies and add optional pyarrow engine#26

Open
wsnoble wants to merge 4 commits into
add-unit-testsfrom
scale-phase-1-2
Open

Phase 1: eliminate redundant data copies and add optional pyarrow engine#26
wsnoble wants to merge 4 commits into
add-unit-testsfrom
scale-phase-1-2

Conversation

@wsnoble

@wsnoble wsnoble commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Remove .copy() from PsmDataset.data property — previously allocated a full copy of the dataset on every property access.
  • Change copy_data default to False in PsmDataset.__init__ and all parser entry-points (read_txt, read_tide, read_comet, read_msamanda, read_msfragger, read_msgf) — eliminates a redundant deep copy when callers pass an existing DataFrame.
  • Use pyarrow CSV engine (with C-engine fallback) in _parse_psms() and parse_psms_txt() — 2–4× faster parsing when pyarrow is installed.

All 90 unit tests pass. This is Phase 1 of SCALING_PLAN.md.

Test plan

  • CI passes (90 unit tests + lint)
  • Manually verify read_tide / read_txt still work on a real Tide file
  • Install pyarrow and confirm pyarrow path is exercised

🤖 Generated with Claude Code

@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8de0c0d7-2c0c-4390-a3ed-1ad4349468cd

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch scale-phase-1-2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@wsnoble wsnoble force-pushed the scale-phase-1-2 branch 3 times, most recently from 92a5f1f to 602d9b6 Compare June 19, 2026 22:56
@wsnoble wsnoble requested a review from Copilot June 19, 2026 22:56

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces memory overhead when constructing/inspecting PsmDataset and speeds up text parsing by optionally using pandas’ pyarrow CSV engine (with fallback).

Changes:

  • Remove deep-copy behavior from PsmDataset.data and change copy_data defaults to False across dataset/parsers.
  • Add optional engine="pyarrow" usage for read_csv in text parsing helpers.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
crema/utils.py Adds optional pyarrow-based read_csv path in parse_psms_txt()
crema/parsers/txt.py Adds optional pyarrow-based read_csv path in _parse_psms() and changes read_txt(copy_data) default
crema/parsers/tide.py Changes read_tide(copy_data) default to False
crema/parsers/msgf.py Changes read_msgf(copy_data) default to False
crema/parsers/msfragger.py Changes read_msfragger(copy_data) default to False
crema/parsers/msamanda.py Changes read_msamanda(copy_data) default to False
crema/parsers/comet.py Changes read_comet(copy_data) default to False
crema/dataset.py Changes copy_data default to False and stops copying on .data access

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread crema/utils.py Outdated
Comment on lines +88 to +94
kwargs = dict(
sep="\t", skiprows=int(skip_line), usecols=lambda c: c in cols
)
try:
return pd.read_csv(txt_file, engine="pyarrow", **kwargs)
except ImportError:
return pd.read_csv(txt_file, **kwargs)
Comment thread crema/parsers/txt.py
Comment thread crema/dataset.py
Comment on lines 46 to +50
If true, a deep copy of the data is created. This uses more memory, but
is safer because it prevents accidental modification of the underlying
data. This argument only has an effect when `pin_files` is a
:py:class:`pandas.DataFrame`
:py:class:`pandas.DataFrame`. Default is ``False`` to minimise memory
use; set to ``True`` if you need to protect the original DataFrame.
Comment thread crema/dataset.py
Comment on lines 118 to +120
def data(self):
"""The collection of PSMs as a :py:class:`pandas.DataFrame`."""
return self._data.copy()
return self._data

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Comment thread crema/utils.py Outdated
# pyarrow engine doesn't support callable usecols, so pass an explicit list.
# Fall back to the default C engine if pyarrow is missing or the pandas
# version doesn't support engine="pyarrow" (raises ImportError or ValueError).
with open(txt_file) as fh:

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Comment thread crema/utils.py
Comment on lines +101 to +103
fh.readline()
header = fh.readline().rstrip("\n").split("\t")
explicit_cols = [c for c in header if c in cols]
Comment thread crema/dataset.py
Comment on lines +119 to +124
"""The collection of PSMs as a :py:class:`pandas.DataFrame`.

Returns the internal DataFrame by reference. Callers that need an
independent copy should call ``.data.copy()`` explicitly.
"""
return self._data
wsnoble and others added 4 commits June 19, 2026 19:20
- Remove `.copy()` from `PsmDataset.data` property; callers that need an
  independent copy must now call `.data.copy()` explicitly.
- Change `copy_data` default from `True` to `False` in `PsmDataset.__init__`
  and in all public parser entry-points (`read_txt`, `read_tide`, `read_comet`,
  `read_msamanda`, `read_msfragger`, `read_msgf`) so that file-backed
  DataFrames are never copied a second time on ingestion.
- Use pyarrow CSV engine in `_parse_psms()` and `parse_psms_txt()` when
  pyarrow is available, falling back to the default C engine otherwise.
  This cuts parse time roughly 2–4× on large files at no API cost.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- utils.py: pyarrow engine doesn't support callable usecols; pre-compute an
  explicit column list from the file header and catch (ImportError, ValueError)
  so older pandas versions fall back cleanly to the C engine.
- parsers/txt.py: also catch ValueError (not just ImportError) in pyarrow
  fallback, for consistency with utils.py and older pandas support.
- dataset.py: fix copy_data docstring ('pin_files' → 'psms'); document that
  the data property returns the internal DataFrame by reference.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Open the file with encoding='utf-8' (matching pandas read_csv default)
when pre-reading the header for the pyarrow usecols list. If a
UnicodeDecodeError occurs, fall back directly to the C engine with a
callable usecols filter, avoiding a crash on non-UTF-8 locale files.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- confidence.py: self._data = psms.data.copy() to avoid mutating the
  original PsmDataset when confidence assigns columns (e.g. "pairing")
- utils.py: rstrip("\r\n") instead of "\n" so Windows CRLF files are
  parsed correctly by the pyarrow engine path

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants