Reproducibility repository for the seqproc preprint. Every number, table cell,
and figure in the paper is produced by a script in this repository, applied to
the publicly available SRA datasets listed below. This branch (paper-submission)
contains only the scripts, configs, and tests needed to reproduce the paper.
Wider development history (exploratory scripts, abandoned chemistries,
intermediate findings docs) lives on the main development branch.
The paper benchmarks seqproc against matchbox and splitcode on four single-cell RNA-seq chemistries, measuring runtime, memory, and a per-chemistry recovery metric (see Recovery metrics below).
| Paper artifact | Produced by (in this repo) | Key output files |
|---|---|---|
| Table 3 (cross-chemistry benchmark — Total / Emitted / Validated / Recovery / Runtime / Memory) | run_paper_benchmarks.py + concordance_analysis.py + splitcode_lr_dual_validate.py + validate_lr_recall_against_vtotal.py + validate_pe_recall_against_vtotal.py |
results/paper_figures/benchmark_results.json, results/concordance/<chemistry>/results.json, results/lr_recall_vtotal/lr_recall_vtotal_results.json, results/pe_recall_vtotal/pe_recall_vtotal_results.json, results/splitcode_lr_dual/splitcode_lr_dual_results.json |
| Figure 3 (Recovery by Dataset and Tool — grouped bar chart) | generate_figures.py consuming the benchmark and V_total JSONs |
results/paper_figures/fig_recovery_comparison.{pdf,png} |
| Figure 4 (Hamming vs Edit Distance — relative recovery gain) | concordance_analysis.py (hamming_vs_edit subset) + generate_figures.py |
results/paper_figures/fig_hamming_vs_edit.{pdf,png} |
| Supplementary Figure S1 (Discordant read recovery breakdown) | discordant_analysis.py + generate_figures.py |
results/paper_figures/fig_discordant_summary.{pdf,png}, results/paper_figures/discordant_analysis.json |
| Supplementary Note S2 (V_total methodology) | validate_lr_recall_against_vtotal.py (LR) + validate_pe_recall_against_vtotal.py (PE) |
JSON outputs listed in Table 3 row above |
| Appendix B Table (LR-SPLiT-seq optimization progression on 1M-read subset) | lr_perf_rerun.py |
results/lr_perf/lr_perf_rerun_results.json |
The corresponding LaTeX in the paper repo cites the figure and JSON paths listed above; regenerating the JSONs and figures from this repo regenerates those exact files.
The paper reports a single column called Recovery (%) but the metric behind it differs across chemistries because the tools' LR-SPLiT-seq output formats diverge in ways that prevent a single validator from operating on all three. The methodology in detail:
| Chemistry | Recovery metric | Numerator | Denominator |
|---|---|---|---|
| SPLiT-seq PE | Recall vs V_total_PE | |tool emit IDs ∩ V_total_PE| |
|V_total_PE| (validity script on raw R2) |
| LR-SPLiT-seq | Recall vs V_total_LR | |tool emit IDs ∩ V_total_LR| |
|V_total_LR| = 489,305 (validity script on raw input) |
| 10x Chromium v2 | Valid Recovery (all 100% — trivial chemistry) | Validated reads in tool output | Total input reads |
| sci-RNA-seq3 | Valid Recovery | Validated reads in tool output | Total input reads |
V_total is the read-ID set the independent structural-validity script
accepts when run on the raw input FASTQ — i.e., the strict ground-truth set
of recoverable reads, defined entirely by the data and the validator's
criteria (exact str.find linker scan + Hamming ≤ 1 whitelist match).
A tool's Recovery is its recall against that set.
For more on why LR-SPLiT-seq requires the V_total framing (the three tools emit incompatible output formats — full FASTQ, compact barcode-only FASTQ, and TSV), see Supplementary Note S2 in the paper.
All scripts live under scripts/. Run them from the repo root with the
Python environment from requirements.txt.
Runs all three tools on each of the four chemistries, three replicates each, collecting wall-clock runtime, peak memory, emitted-read counts, and structurally validated read counts. Produces the bulk of Table 3.
python3 scripts/run_paper_benchmarks.py --threads 32 --replicates 3Inputs: tool binaries (set via SEQPROC_BIN, MATCHBOX_BIN,
SPLITCODE_BIN), raw FASTQs in $SEQPROC_DATA_DIR, configs in configs/.
Output: results/paper_figures/benchmark_results.json with one entry per
dataset, each containing total_reads, per-tool reads_out, recovery_rate,
runtime, peak_mem_mb, plus a sidecar .benchmark_checkpoint.json that
retains per-replicate reads_valid counts.
Paper artifacts: Table 3 (Total / Emitted / Validated / Runtime / Memory columns for short-read chemistries).
For each chemistry, runs all three tools and computes pairwise Jaccard indices between their emitted-read ID sets. Also re-runs seqproc with Hamming-only configs and compares against the edit-distance baseline.
python3 scripts/concordance_analysis.py --threads 32 --datasets all
# or restrict:
python3 scripts/concordance_analysis.py --datasets lr_splitseq splitseq_peOutput: results/concordance/<chemistry>/results.json per chemistry plus
results/concordance/<chemistry>/{seqproc_edit,matchbox,splitcode}_ids.txt
holding the actual emitted-read ID sets. The latter are the inputs to the
V_total intersection scripts below.
Paper artifacts: the hamming_vs_edit subsection of each chemistry's JSON
feeds Figure 4 (Hamming vs Edit Distance). The ID-set caches are used by the
discordant-analysis and V_total scripts.
Identifies reads recovered by only one of the three tools (the "discordant" set), then runs an independent linker + whitelist validator on those records to determine how many represent genuine signal vs. false-positive emit. On SPLiT-seq PE this is where the 99.2% structurally-invalid figure for splitcode's unique reads comes from.
python3 scripts/discordant_analysis.pyOutput: results/paper_figures/discordant_analysis.json with per-tool,
per-chemistry breakdown of unique-emit validity rates.
Paper artifacts: Supplementary Figure S1, and the prose at
results.tex:53 ("99.2% lacked valid linker sequences").
Splitcode does not natively support orientation-aware matching, which on a randomly-oriented chemistry like LR-SPLiT-seq leaves a fair-comparison gap. This script runs splitcode twice (forward and reverse-complement), unions the emitted-read sets, computes per-pass and combined validity, and also runs the strict raw-input validator to produce V_total_LR (the denominator for the LR-SPLiT-seq Recovery cells of Table 3).
python3 scripts/splitcode_lr_dual_validate.py \
--dataset full --threads 8 \
--outdir results/splitcode_lr_dualOutput: results/splitcode_lr_dual/splitcode_lr_dual_results.json with
the dual-pass aggregate counts, summed runtime, per-pass validity precision,
and V_total_LR (489,305 reads on this dataset). Also writes the deduplicated
combined FASTQ to the outdir for downstream intersection analysis.
Paper artifacts: the splitcode dual-pass row of Table 3 (LR-SPLiT-seq); the V_total_LR denominator referenced throughout the LR row.
Takes the V_total_LR ID set from the strict raw-input validator and intersects
it with each tool's emitted-read ID set (loaded from
results/concordance/lr_splitseq/*_ids.txt and from the dual-pass FASTQ).
Outputs exact |tool_emit ∩ V_total_LR| counts per tool — these are the
Validated column values in the LR-SPLiT-seq rows of Table 3.
python3 scripts/validate_lr_recall_against_vtotal.py \
--dual-pass-fq results/splitcode_lr_dual/splitcode_dual_combined_out.fq \
--outdir results/lr_recall_vtotalOutput: results/lr_recall_vtotal/lr_recall_vtotal_results.json with the
four |tool ∩ V_total| counts and recall percentages (98.14%, 87.56%, 53.85%,
100.00% in the canonical run).
Paper artifacts: Table 3 LR-SPLiT-seq Validated cells (480,192 / 428,424 / 263,499 / 489,305).
The SPLiT-seq PE analogue of the LR script above. Computes V_total_PE by
running the paired-end validator on raw R1+R2, then intersects against each
tool's emit IDs in results/concordance/splitseq_pe/.
python3 scripts/validate_pe_recall_against_vtotal.py \
--outdir results/pe_recall_vtotalOutput: results/pe_recall_vtotal/pe_recall_vtotal_results.json.
Paper artifacts: Table 3 SPLiT-seq PE Validated cells under the V_total framing.
Reproduces Appendix B's table showing the per-feature progression of seqproc on LR-SPLiT-seq (forward-only Hamming → forward-only edit → orientation-aware Hamming → orientation-aware edit). Used on a 1M-read subset for fast iteration.
python3 scripts/lr_perf_rerun.py --threads 4 --reps 3Output: results/lr_perf/lr_perf_rerun_results.json.
Paper artifacts: Table 4 in paper/sections/appendix_optimization.tex
(LR-SPLiT-seq optimization progression).
Reads the JSON outputs of the scripts above and produces the publication figures.
python3 scripts/generate_figures.pyOutput: under results/paper_figures/:
fig_recovery_comparison.{pdf,png}(Figure 3)fig_hamming_vs_edit.{pdf,png}(Figure 4)fig_discordant_summary.{pdf,png}(Supplementary Figure S1)
Single source of truth for: SRA accessions, FASTQ file paths, per-chemistry
seqproc geometry files, matchbox configs, splitcode configs, and barcode
map paths. All driver scripts import from here. Run as python3 scripts/data_config.py
to print the resolved paths for the current environment.
End-to-end setup script: installs Python + micromamba env, clones and builds tool binaries (seqproc, matchbox, splitcode), downloads the four SRA datasets, sets the env vars, runs the full benchmark pipeline. Designed for a fresh Linux machine with internet + gcc. Re-running is idempotent.
chmod +x scripts/setup_and_run.sh
./scripts/setup_and_run.shSanity-checks that every binary, config, FASTQ, and Python dependency required by the driver scripts is present and resolvable. Exits non-zero if anything is missing. Run this before any benchmark run on a new machine.
./scripts/dry_run_test.shRuns the benchmark, concordance, discordant, and figure-generation scripts in sequence with consistent threading and replicate count.
./scripts/run_all.sh --reads full --threads 32If the environment is already configured (binaries on path, FASTQs in
$SEQPROC_DATA_DIR, Python env active), the full pipeline is:
# 1. Validate environment
./scripts/dry_run_test.sh
# 2. Main benchmark (Table 3 short-read rows + runtime/memory for all)
python3 scripts/run_paper_benchmarks.py --threads 32 --replicates 3
# 3. Pairwise concordance + Hamming-vs-edit data (Figure 4)
python3 scripts/concordance_analysis.py --threads 32 --datasets all
# 4. Discordant-read structural validation (Supp Fig S1)
python3 scripts/discordant_analysis.py
# 5. LR-SPLiT-seq splitcode dual-pass and raw V_total_LR
python3 scripts/splitcode_lr_dual_validate.py --dataset full --threads 8 \
--outdir results/splitcode_lr_dual
# 6. LR-SPLiT-seq V_total intersection (Table 3 LR Validated cells)
python3 scripts/validate_lr_recall_against_vtotal.py \
--dual-pass-fq results/splitcode_lr_dual/splitcode_dual_combined_out.fq \
--outdir results/lr_recall_vtotal
# 7. SPLiT-seq PE V_total intersection (Table 3 PE Validated cells)
python3 scripts/validate_pe_recall_against_vtotal.py \
--outdir results/pe_recall_vtotal
# 8. Optimization progression (Appendix B)
python3 scripts/lr_perf_rerun.py --threads 4 --reps 3
# 9. Render figures from collected JSONs
python3 scripts/generate_figures.pyFor a clean machine without any of the above, use ./scripts/setup_and_run.sh
which performs steps 1 through 9 plus all the environment provisioning.
| Chemistry | SRA accession | Total reads (full) | Mode |
|---|---|---|---|
| SPLiT-seq PE | SRR6750041 |
86,820,578 | paired-end |
| LR-SPLiT-seq | SRR13948564 |
5,764,421 | single-end (PacBio HiFi) |
| 10x Chromium v2 | SRR8315379 |
56,514,800 | paired-end |
| sci-RNA-seq3 | SRR7827254 |
22,088,821 | paired-end |
Read counts are authoritative as of the canonical run; they are also
encoded in scripts/data_config.py:SRA_INFO and verified against
count_fastq_reads() at runtime.
FASTQ files are gitignored. Download with fasterq-dump <accession> into
$SEQPROC_DATA_DIR (default data/) before running anything.
configs/ contains the geometry / config files actually used by the paper:
configs/
seqproc/
splitseq_filter_edit.geom # SPLiT-seq PE, edit distance + whitelist
splitseq_filter_hamming6.geom # SPLiT-seq PE, Hamming baseline (Fig 4)
splitseq_replacement_edit.geom # SPLiT-seq PE, replacement-mode variant
splitseq_singleend_edit_ann.geom # LR-SPLiT-seq, orientation + edit (canonical)
splitseq_singleend_ann.geom # LR-SPLiT-seq, orientation + Hamming
splitseq_singleend_edit.geom # LR-SPLiT-seq, forward-only + edit
splitseq_singleend.geom # LR-SPLiT-seq, forward-only + Hamming
splitseq_singleend_primer_edit.geom # LR-SPLiT-seq, primer-based variant
10x_v2.geom # 10x Chromium v2
sciseq3.geom # sci-RNA-seq3, Hamming
sciseq3_edit.geom # sci-RNA-seq3, edit distance
splitseq_bc{1,2,3}_seq2seq.tsv # SPLiT-seq barcode maps (seq → seq)
splitseq_bc{1_whitelist_6bp,23_whitelist}.txt # whitelists
matchbox/
splitseq_replacement.mb # SPLiT-seq PE
splitseq_singleend.mb # LR-SPLiT-seq, forward-only
splitseq_singleend_dual.mb # LR-SPLiT-seq, dual-orientation
10x_v2.mb # 10x Chromium v2
sciseq3.mb # sci-RNA-seq3
r2_r3.txt # matchbox support file for SPLiT-seq
splitcode/
splitseq_paper.config # SPLiT-seq PE
splitseq_singleend.config # LR-SPLiT-seq
10x_v2.config # 10x Chromium v2
sciseq3.config # sci-RNA-seq3
The previous 10x_longread_* configs (from an aborted long-read 10x
experiment that did not make the paper) have been removed in this branch.
A regression test suite guards the benchmark pipeline against silently-incorrect behavior:
python3 -m pytest tests/ -vCoverage:
- Script syntax (every Python script compiles)
- Module imports (each script importable, with expected constants)
- Path resolution against
data_config.py - DATASETS dict consistency (every referenced file exists)
- Bug-fix regressions (subprocess exit code parsing, barcode-distance scoping, paired-end ID extraction)
- DNA reverse-complement primitive correctness
- Jaccard index correctness
- Recovery rates within expected bounds (when cluster artifacts are present)
- Banned-term guard (catches accidental development-iteration jargon leaking into the published code)
- Structural-validity analyzer correctness on synthetic and real reads
(
test_se_validity.py,test_se_validity_real_data.py)
A pre-commit hook in .git/hooks/pre-commit runs the suite automatically
before every commit; commits failing the suite are rejected.
scripts/
run_paper_benchmarks.py # main benchmark runner (Table 3)
concordance_analysis.py # pairwise concordance + hamming-vs-edit (Fig 4)
discordant_analysis.py # unique-read structural validation (Supp Fig S1)
generate_figures.py # render publication figures
splitcode_lr_dual_validate.py # splitcode dual-pass + V_total_LR (Table 3 LR rows)
validate_lr_recall_against_vtotal.py # exact LR Validated cells (Table 3)
validate_pe_recall_against_vtotal.py # exact PE Validated cells (Table 3)
lr_perf_rerun.py # LR optimization progression (Appendix B)
data_config.py # shared paths / datasets / configs module
setup_and_run.sh # clean-machine bootstrap (binaries + data + run)
dry_run_test.sh # environment validation
run_all.sh # benchmark + concordance + discordant + figures
configs/
seqproc/ # EFGDL geometries (.geom) + whitelists
matchbox/ # matchbox pattern files (.mb)
splitcode/ # splitcode configs (.config)
tests/
test_benchmark_pipeline.py # bug-fix unit tests
test_regression.py # whole-pipeline regression suite
test_se_validity.py # SplitSeqSingleEndValidityAnalyzer unit tests
test_se_validity_real_data.py # validator behavior on actual FASTQ samples
conftest.py # shared pytest fixtures
data/ # (gitignored) raw FASTQs from SRA
results/ # (symlink) all JSON outputs and figures
README.md # this file
requirements.txt # Python dependencies (matplotlib, numpy, etc.)
pytest.ini # pytest configuration
.gitignore
- Python 3.6+ (3.11 recommended; cluster runs use micromamba env
bench) - numpy, matplotlib (
pip install -r requirements.txt) - pytest for the test suite
- Tool binaries: seqproc, matchbox, splitcode — pinned versions installed by
setup_and_run.shfrom their respective repositories - For SRA downloads:
sra-tools(fasterq-dump), installed bysetup_and_run.sh
Set the following before running scripts manually (setup_and_run.sh exports
them automatically):
export SEQPROC_PROJECT_ROOT=$(pwd) # this repo
export SEQPROC_DATA_DIR=$WORKDIR/data # FASTQ location
export SEQPROC_BIN=<path>/seqproc # tool binaries
export MATCHBOX_BIN=<path>/matchbox
export SPLITCODE_BIN=<path>/splitcodeIf you use this analysis, please cite the seqproc paper.