Refactor: from ad-hoc benchmark scripts to a Meteor Benchmark Platform by italojs · Pull Request #22 · meteor/performance

italojs · 2026-06-30T14:00:37Z

Refactor: from ad-hoc benchmark scripts to a Meteor Benchmark Platform

Branch: redesign/v2-blaze-tailwind → main
74 commits · 229 files changed · +26,628 / −38,033

live demo: https://meteor-benchmarks.us.galaxycloud.app/

Summary

This PR rewrites the repository from a loose collection of benchmark scripts, Playwright tests and hand-captured .log files into a structured, reproducible benchmark platform for Meteor.

What used to be "run a script, eyeball a log, commit the output" is now a single CLI (bench.js) with a modular pipeline — drivers → collectors → aggregators → reporters — backed by a Meteor instrumentation package, a results dashboard, a unit-test suite, and CI workflows that run benchmarks on PRs, nightly, and across transport/observer matrices.

The net line count drops (−38k) because ~13k lines of stale benchmark logs and the entire Meteor 2.x app were removed, while the new harness, monitors, dashboard and tests were added.

Why

The old setup (main) had real limitations:

Manual & non-reproducible — benchmarks were run by hand and results pasted into benchmarks/**/*.log. No schema, no comparison tooling, no regression gate.
Stale artifacts — thousands of lines of committed run logs that nobody could re-derive.
Two apps to maintain — Meteor 2.x and 3.x, plus abandoned OTel / APM-agent experiments.
No CI signal — nothing ran benchmarks automatically or flagged regressions.

The goal of this branch was to make benchmarking a first-class, automatable workflow: one command to run a scenario, structured JSON out, automatic regression detection against a baseline, and a dashboard to visualize trends across Meteor versions and transport/observer configurations.

What changed

1. New CLI & harness architecture

A thin bench.js entry point dispatches to focused modules:

bench.js              # parse argv → dispatch → exit
├── cli/              # subcommand handlers: run, list, compare, push, baseline, bundle-delta
├── drivers/          # how a scenario executes: artillery, script, cold-start, bundle-size, build-profile
├── collectors/       # live process/DB sampling: cpu/ram, event-loop, gc, mongo ops/pool/wiredtiger/...
├── runner/           # orchestration + per-metric aggregators
├── reporters/        # json-reporter + regression-detector (markdown/json output)
├── lib/              # shared pure helpers (percentiles, ...)
└── meteor-source.js  # resolve pinned release vs local checkout

Subcommands:

Command	Purpose
`node bench.js list`	List scenarios and apps
`node bench.js run --scenario X --app Y --tag Z`	Run a benchmark, write result JSON
`node bench.js compare --baseline A --target B`	Diff two results, detect regressions
`node bench.js push --result file.json`	Push a result to the dashboard
`node bench.js baseline --scenario X --run-id Y`	Pin a run as the scenario baseline
`node bench.js bundle-delta [--limit N]`	Bundle-size trend across saved runs

2. `bench-monitors` Meteor package (server-side instrumentation)

A new in-app package (apps/tasks-3.x/packages/bench-monitors/) injects lightweight, opt-in server instrumentation that emits parseable metrics consumed by the harness:

Method timing, subscription timing, live-update propagation latency
DDP message counter, DDP frame size, DDP compression
Observer-pool sampler, driver-fallback tracker
Dump-on-shutdown hook so metrics survive process exit

3. Metric collectors & aggregators (tasks 01–24)

A broad set of metrics, each with a collector (sampling) + aggregator (summarizing) + unit tests + dashboard panel:

Process: CPU/RAM, event-loop lag, GC pauses
Mongo: ops rates, slow queries, index usage, connection pool, WiredTiger cache, change streams
DDP: method/sub timing, message rate, frame size, compression
Meteor internals: observer pool, driver fallbacks
Build: METEOR_PROFILE=1 build profile (hot nodes) + per-plugin compile time, bundle-size delta

4. Results dashboard (`apps/dashboard/`) — design v2

A new Meteor app to visualize runs, built on a Tailwind design system (v2):

Design system — swapped Bootstrap → Tailwind, with Geist / JetBrains Mono typography and shared theme tokens.
Rebuilt pages — Runs overview, Detail (grouped metric sections + sticky section rail), Compare (regression scoreboard + side-by-side diff), Scenario view, and Trends.
Runs are pushed over DDP and rendered with per-metric panels for every metric above; deployed to Galaxy at meteor-benchmarks.us.galaxycloud.app.

5. Runtime observability & configuration matrix

The app logs [runtime-info] observer_driver=… / transport=… on startup; the harness captures these from stderr into each result's runtime field, so every pushed run is self-describing.
Supports benchmarking published releases (--meteor-version) or a local checkout (--meteor-checkout), mutually exclusive.
Enables explicit {changeStreams, oplog} × {sockjs, uws} comparison on the dashboard.

6. CI workflows

benchmark-pr.yml — run benchmarks on PRs (with hardened client_payload handling)
benchmark-nightly.yml — scheduled runs
benchmark-runtime-matrix.yml — the 2×2 observer × transport matrix
benchmark-transport.yml — sockjs vs uws

7. Test suite

~40 node:test unit-test files covering every aggregator, the regression detector (incl. zero-baseline / NaN / Infinity edge cases), CLI commands, the meteor-source resolver, runtime-info extraction, and a metric-keys contract test to keep collector output and the dashboard in sync.

8. Cleanup / removals

Removed the Meteor 2.x app (apps/tasks-2.x) — focus is on Meteor 3.x.
Removed OTel and the APM-agent experiments.
Deleted ~13k lines of stale, hand-captured benchmark .log files under benchmarks/.
Collapsed top-level packages/ into the app; pruned obsolete files and tightened .gitignore.
Converted the harness to ESM and bumped to Node 24 (CI + Volta).

Migration notes

Node 24 required (was Node 20).
The harness is now ESM ("type": "module").
The Meteor 2.x app is gone — all scenarios target tasks-3.x.
Old benchmarks/**/*.log artifacts were intentionally removed; reproduce via node bench.js run instead.

How to test

npm install
npm test                                  # unit suite
node bench.js list                        # sanity-check config
node bench.js run --scenario ddp-reactive-light --app tasks-3.x --tag smoke

🤖 Generated with Claude Code

- bench.js CLI: run, compare, push, baseline, list commands - Collectors: CPU/RAM (pidusage), GC (perf_hooks), event loop delay - Regression detector with configurable thresholds and markdown output - Blaze dashboard app (Meteor 3, Bootstrap 5, Chart.js) - Pages: Dashboard, Compare, Trends, Run Detail - DDP methods for pushing results from CLI - Artillery light scenario (30 VUs) for quick CI runs - GitHub Actions workflows: PR benchmark + nightly

…t fetch)

…nvrc

…compare

Two new scenarios using SimpleDDP + ws (no Playwright/Chromium): - ddp-reactive-light: subscribe + CRUD (150 VUs, 30s) - ddp-non-reactive-light: methods-only CRUD (150 VUs, 30s) Isolates server/DDP performance from browser rendering overhead.

Add curl + sleep before DDP push to wake Galaxy free tier from cold start. Add timeout-minutes: 2 on push steps to avoid 30min CI hangs.

Clicking a scenario name in dashboard or detail view opens a page with: - Simple description (what does this scenario do?) - At-a-glance table (driver, VUs, duration, browser required) - Technical details (DDP flow, oplog, collectors) - Recent runs for this scenario

cold-start: runs meteor reset + meteor run N times, reports median/min/max startup time bundle-size: runs meteor build --directory, reports client JS, server, total bundle size + build time

New driver type 'script' for standalone Node.js benchmarks. fanout-bench.js: connects N subscribers, 1 writer does inserts, measures time for all subscribers to receive the reactive update. Reports p50/p95/p99/avg/max fanout latency + CPU/RAM/GC.

Usage: node bench.js run --scenario ddp-reactive-light --env DDP_TRANSPORT=uws Supports multiple: --env DDP_TRANSPORT=uws --env MONGO_OPLOG_URL=... Injected into all Meteor spawns (run, script, cold-start, bundle-size)

Runs the same scenario on the same branch with DDP_TRANSPORT=sockjs and DDP_TRANSPORT=uws in parallel. Pushes both results to dashboard for comparison.

Aligns the harness's declared runtime with the version we want to target across the refactor. Workflows previously pinned Node 22; volta and package.json engines declared 20.x. After this commit everything agrees on >=24. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Meteor 2.x is end-of-life and no CI workflow references the tasks-2.x benchmark app. Removing it shrinks the harness scope to the supported Meteor 3 surface — fewer apps to keep building, fewer env vars to document, one less map entry to wire scenarios against. The bench.config.js apps map, README structure tree, and RUNTIME deploy docs now reflect tasks-3.x as the sole benchmark target. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Meteor auto-discovers <app>/packages/, so the top-level packages/ directory needed a METEOR_PACKAGE_DIRS injection at every meteor spawn to be picked up. Moving tasks-common under apps/tasks-3.x/packages/ removes that ceremony — the package now lives where Meteor expects it and bench.js drops four redundant env-var assignments. apm-agent only served two orphaned shell scripts (monitor.sh, deploy.sh) and was user-approved for deletion; SCRIPTS.md (commit 13) will flag the caveat for anyone who needs to re-add it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The otel/ directory held a Grafana Tempo + collector stack that no script, workflow, or harness module references — confirmed by grep across the repo. Removing it from the working tree (it was never git-tracked) drops the unused infra spec, and the RUNTIME.md Deploy section sheds its MontiAPM/ENABLE_APM narrative since the public harness path (bench.js) doesn't toggle APM. ENABLE_APM remains live in the orphan scripts/monitor.sh and scripts/deploy.sh; commit 13's SCRIPTS.md will document those as legacy ops helpers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The workflow accepted repository_dispatch payloads and interpolated client_payload.* values directly into `run:` shell blocks (and into a github-script body), which is the GitHub Actions command-injection pattern documented in https://github.blog/security/vulnerability-research/how-to-catch-github-actions-workflow-injections-before-attackers-do/ A caller able to dispatch the `benchmark-pr` event could inject shell commands via fields like client_payload.branch. Fix: every value that flows from event payload or inputs into a shell or script body now goes through the step's `env:` block first, and the body references the value via $VAR (shell) or process.env.VAR (github-script). Downstream steps that consumed steps.params.outputs.* in shell get the same treatment. Behavior is unchanged. Nightly and Transport workflows trigger only on schedule/workflow_dispatch (no external payload), so they are out of scope for this commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Node 24 ships with first-class ESM and node:util.parseArgs, removing the last reasons to keep the harness on CJS + minimist. Adding "type": "module" lets every harness module use top-level static imports, dropping the dual require()/dynamic-require pattern in bench.js (simpleddp/ws were lazy-required just to delay the dependency cost). parseArgs gives us a typed schema for every flag the CLI accepts, including the repeatable --env KEY=VALUE (multiple: true). The CLI contract is byte-identical: same flag names, same exit codes, same help text. gc-monitor stays CJS as gc-monitor.cjs since Node's --require loader is CJS-only — the path reference in bench.js is updated and an explicit comment at the top of the file flags this so no one converts it on a future pass. The dead regression-detector require.main === module CLI shim is removed; bench.js compare is the sole public entrypoint. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Wires npm test to node:test and covers the pure paths of the harness that exist today: splitEnvArgs, regression-detector (compare + toMarkdown), buildResult shape/keying/fallback, and writeResult + appendToHistory using os.tmpdir() for fs assertions. Six fixtures under tests/unit/fixtures/ — baseline plus targets for pass / regression / improvement / zero-baseline / non-finite scenarios — match the collector-produced JSON schema the dashboard reads. regression-detector tests for zero-baseline, null target, NaN, and Infinity pin the CURRENT (buggy) silent-skip behavior with explicit "TODO commit 11" comments, so the diff at commit 11 is unambiguous. The buildResult tests skip the meteorCheckoutPath git-shelling branch because mock.method against the node:child_process namespace fails with "Cannot redefine property: execSync" — that branch is being deleted in commit 7 (buildResult becomes pure, taking meteor: {version, sha} as input), and commit 7 will add coverage for the pure signature directly. 48 tests pass in ~90ms; well under the 5s npm test budget. No Meteor, no network. node bench.js list and node bench.js compare against the new fixtures both work; compare exits 0 on the passing-target fixture. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Meteor introspection now has a single source of truth. The previous duplication — git-shelling once inside reporters/json-reporter.js buildResult and once inside bench.js getMeteorInfo — collapses into resolveMeteorSource({flags, env, config}) in meteor-source.js, called once per command and threaded through as a source object. buildResult becomes pure: it takes meteor: {version, sha} as input, no shells out. The two try/catches that silently returned 'unknown' on git failure are gone — getMeteorInfo now throws an actionable error naming the checkout path and asking "Is this a git checkout?" so misconfigured runs fail loud instead of producing JSON with sha='unknown' that hides the real problem. The function signature already accepts the meteor-version / METEOR_RELEASE / config.meteorVersion inputs commit 8 will wire in, so commit 8 only adds the release-mode branch. Git shelling uses execFileSync('git', [...]) instead of execSync(cmd) — no shell, no possibility of injection from the checkout path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Lets users benchmark a published Meteor release without checking out the source — useful for reproducible CI runs and comparing released versions. resolveMeteorSource extends commit 7's checkout/system branches with a release branch that pre-bakes meteor.version=<version> and meteor.sha='release:<version>'; the 'release:' prefix keeps the existing dashboard JSON contract (non-empty strings) while making the mode visible at a glance. Inputs come from flag > env > config: --meteor-version, METEOR_RELEASE, config.meteorVersion. The mode is mutually exclusive with checkout — both set with usable values throws an error naming the conflicting strings ("got version=X and checkout=Y. Pick one."). The exclusion check requires the checkout binary to actually exist, so a stale config.meteorCheckoutPath default doesn't block --meteor-version. Every meteor spawn site picks up source.releaseArg via a small meteorArgv/meteorShellPrefix helper — 9 sites, 0 duplication. README's new "Meteor source" section documents both modes side-by-side. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Eliminates the 80-line app-lifecycle duplication between cmdRun and cmdScript in bench.js by moving install-deps, reset, start, wait-for, collectors, and stop into runner/. waitForApp swaps the legacy execSync('curl ...') + execSync('sleep 1') polling loop for native fetch + node:timers/promises, with a clear actionable error on timeout. findPid keeps its one tight try/catch — pgrep exits 1 on no match, that's documented as an expected absence so callers can treat null as "skip this collector". The runner sits behind runner/_io.js — a plain-object I/O facade that re-binds node:child_process / node:fs / node:timers/promises functions and a fetch wrapper as configurable properties. ESM namespace exports (including re-exports from node:*) are non-configurable, so tests hitting mock.method against them throw "Cannot redefine property"; the plain object dodges that without DI scaffolding or {exec, fs, spawn} function params. Production reads `io.execSync(...)` instead of `execSync(...)` — a three-character prefix in exchange for a fully mockable boundary. Also closes the latent shell-injection surface QA flagged on commit 8: meteor reset / meteor run / npm install all switch from execSync(`${meteorCmd} subcommand`) template-literal shell calls to io.execFileSync(meteorCmd, argv) / io.spawn(meteorCmd, argv). No shell, no parsing of source.releaseArg or app paths as shell input, no possibility of injection from user-controlled inputs even if a future caller passes hostile data. cmdBundleSize's `du -sk` and `rm -rf` shell-outs remain — spec'd for commit 10's drivers/bundle- size.js conversion. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

bench.js shrinks from 547 lines to 93 — now just parseArgs schema, command switch, and inline help. Every command body lives in a single-purpose cli/ module that calls into a driver and persists the result. Each driver owns one scenario kind (artillery, script, cold- start, bundle-size) and returns a buildResult-shaped object so the runtime contract stays uniform across all 10 scenarios. drivers/bundle-size.js drops the last shell-outs: `du -sk` becomes a recursive io.statSync sum, `rm -rf` becomes io.rmSync({recursive:true, force:true}), and `meteor build` becomes io.execFileSync(argv) like the other meteor calls in commit 9. Zero template-literal shell calls remain in the new code; bench.js no longer imports execSync at all. runner/_io.js extends to 15 keys: statSync + rmSync for bundle-size, SimpleDDP + ws for cli/dashboard.js. Per REFACTOR_SPEC.md hard- constraint meteor#4's approved exception, this is the canonical io facade for the whole codebase — drivers/ and cli/ reuse runner/_io.js rather than forking it. drivers/index.js mirrors the same plain-object dispatch pattern so cli/run.js can pick a driver via switch and tests can mock.method individual drivers. cli/compare.js gains an actionable error path: missing or unparseable result files now exit 1 with the file path and a next-step hint ("Check the path or run 'bench.js run' first to produce it." / "Is the file a valid bench.js result?") instead of an unhandled exception. cli/run.js's unknown-scenario / unknown-app errors list the valid options rather than just naming what was wrong. Both are a small commit-12 preview that lands here because tests need to assert against them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Measures server-internal write-to-emit latency: time from Collection.insertAsync resolving to the moment Meteor's observer emits the corresponding `added` DDP message for EACH subscribed client. Surfaces as `metrics.live_update_propagation` (flat aggregate — not per-pub, not per-doc). This is THE metric that attributes wall-clock differences across observer drivers (changeStreams vs oplog vs polling) to actual propagation paths instead of CPU/RAM symptoms. Hooks (both prototype-level — Mongo.Collection and Session classes patched once so every instance and every future Collection picks the wrap up automatically): 1. Mongo.Collection.prototype.insertAsync → post-await, map.set(docId, Date.now()). 2. Session.prototype.sendAdded + sendChanged → if docId in map AND elapsed <= ATTRIBUTION_TTL_MS (10s), record sample. Why server-side Map keyed by docId, NOT the spec's __benchPushedAt in-doc field: - In-doc field pollutes Mongo schema permanently. - Initial-batch contamination: when a new sub connects, the observer's initial fetch fires sendAdded for ALL existing docs. Stale __benchPushedAt values would record ancient timestamps as "propagation". The Map + 10s TTL filters this out cleanly. This is REVISIONS.md task 03 spec-spirit but the implementation diverges from the prose: REVISIONS suggested __benchPushedAt; the gotcha was uncaught in review. Gated entirely on PROPAGATION_TIMING_OUTPUT — without the env var, init is a complete no-op and Mongo writes are never wrapped (zero overhead in dev). Same gate applied retroactively to method-timing and sub-timing for consistency: was always-on wrap with env-gated file output; now env-gated wrap + file output. Rule-of-three refactor: - apps/tasks-3.x/packages/bench-monitors/_dump-on-shutdown.js — extracted shared SIGTERM/SIGINT/beforeExit dump-once helper (~25 lines). Replaces 3× inline copies across method/sub/ propagation timing. Plumbing mirrors tasks 01 + 02: - bench-monitors package: new propagation-timing.server.js + re-export from bench-monitors.server.js. - server/main.js: initPropagationTiming() before registerTaskApi. - runner/meteor-process.js: PROPAGATION_TIMING_OUTPUT env passthrough. - runner/collectors.js: preparePropagationTimingOutput + aggregatePropagationTiming (flat-array variant) + read-on-stop with absence guard. - drivers/{artillery,script}.js: call prepare + pass path through start/stopCollectors. Tests: - tests/unit/propagation-timing-percentiles.test.js — 11 cases covering empty/null, single sample, 1000-sample percentiles, BARE-percentile contract, _ms-suffix contract, flat-shape lock (no per-pub/per-doc keys), zero-valued samples kept. - tests/unit/metric-keys-contract.test.js — extends ALLOWED set by one line (live_update_propagation). bench-monitors README: - Added propagation row to the monitors table. - "Known limitations" section: updates not wrapped (insertions only), TTL trade-off, polling driver dominated by interval. - "How a monitor works" rewritten around the env-gate-first pattern + reference to _dump-on-shutdown.js helper. - File-layout tree updated. Gates: 227/227 tests in ~727ms (+11 new); bench.js list/compare clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Two bugs surfaced during E2E validation of the Phase A metrics (commit ec7c1fa shipped 3 monitors but none of them actually wrote their dump files in a real benchmark run). 1) installDumpOnShutdown — relied solely on SIGTERM/SIGINT/beforeExit handlers to flush samples. The meteor parent process kills its node child without reliably forwarding SIGTERM (collectors/gc-monitor.cjs calls this out at line 88, which is why gc-monitor uses a periodic snapshot). Our helper inherited the buggy pattern and silently lost ALL data — no file ever appeared. Result JSON had no ddp_methods / ddp_subscriptions / live_update_propagation keys despite the in-app collectors recording samples correctly. Fix: write the output file every 5s via setInterval (unref'd so it doesn't block exit). Signal handlers stay as best-effort capture of the last 0-5s. Same shape gc-monitor.cjs uses. 2) propagation-timing.server.js — REVISIONS.md task 03 had wrong code: `Meteor.onConnection(conn => { const session = conn._session; ... })` always saw `conn._session === undefined`. The connection object in Meteor 3.x doesn't expose `_session` — the actual Session lives in `Meteor.server.sessions` (Map keyed by id), created AFTER the DDP `connect` message arrives (post-onConnection). Result: prototype never patched, sendAdded never recorded, samples array empty. Fix: rewrote `tryPatchSessionProto()` to walk `Meteor.server.sessions` and grab the prototype off any live session (all share it). Called lazily from THREE places: onConnection (deferred via setImmediate so Meteor has a tick to create the session), AND inside the insertAsync wrap (covers scenarios where sub→insert ordering races against session creation), AND idempotent on re-entry. Handles both Map and plain-object shapes of Meteor.server.sessions for forward/backward compat. E2E validation on ddp-reactive-light (150 VUs, 30s): - ddp_methods: 6150 calls, 3 methods, p99 insertTask=0.93ms - ddp_subscriptions: 150 subs, p99 fetchTasks=3.05ms - live_update_propagation: 8177 observed updates, p50=1ms, p95=44ms, p99=52ms (observer=oplog) Result JSON now contains all three new metric keys end-to-end. Bench-monitors README updated to reflect the periodic-snapshot-first flush pattern. Unit tests unchanged: 227/227 in 635ms. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Mongo opcounters — per-second rates for insert / query / update / delete / getmore / command, computed from serverStatus().opcounters delta over the benchmark window. Answers "different work or same work done differently?" when comparing observer drivers. Surfaces as `metrics.mongo_ops`: { metric: 'mongo_ops', duration_s: 35.27, ops_per_sec: { insert: 85, query: 176, delete: 89, getmore: 156, ... }, totals: { insert: 3000, query: 6213, delete: 3150, getmore: 5526, ... } } Implementation (CC-6 — mongodb npm driver, not the mongo shell which Meteor 7.0 dev_bundle dropped): - Added `mongodb` as harness production dep. - Exposed `MongoClient` on runner/_io.js (testable via mock.method). - collectors/mongo-ops-monitor.js — standalone ESM script, spawned by the harness alongside process-monitor. Connects to target Mongo, reads serverStatus().opcounters at startup baseline and again on SIGTERM, dumps JSON to stdout. Same drain shape as process-monitor. - runner/mongo-ops-rates.js — pure rate-math, extracted so it's unit-testable without spawn / MongoClient. - runner/collectors.js — startCollectors gained `mongoUri` param, spawns mongo-ops-monitor when set. - drivers/{artillery,script}.js — derive URI as `mongodb://127.0.0.1:${appPort + 1}` (Meteor's local Mongo port), overridable via `BENCH_MONGO_URL` for external Mongo (Galaxy etc.). - Collector skips silently when Mongo isn't reachable (logs to stderr, exits 0 with no stdout → stopCollectors omits the key per absence convention CC-5). Tests: - tests/unit/mongo-ops-rates.test.js — 12 cases covering empty/zero activity, counter reset (end<start → treat delta=end), divide-by- zero on sub/zero/negative durations, new opcounter keys in future Mongo versions, null start, numeric coercion. - metric-keys-contract.test.js — extends ALLOWED set by one line. E2E validation on ddp-reactive-light (150 VUs, 33s): - inserts: 3000 total / 85 ops/sec (exact match — 150 VUs × 20 each) - deletes: 3150 total / 89 ops/sec (3000 removeTask + 150 removeAllTasks) - queries: 6213 / 176 ops/sec (observer reads, change-stream lookups) - getmore: 5526 / 156 ops/sec (change-stream cursor follow-ups) - oplog driver run; numbers will differ across changeStreams/polling. Dashboard panels (apps/dashboard/imports/ui/pages/detail.{html,js}): - DDP Methods — per-method count + avg/p95/p99/max (sorted by count) - DDP Subscriptions — per-publication count + avg/p95/p99/max - Live-update propagation — observed_updates + avg/p50/p95/p99/max - Mongo opcounters — per-op total + ops/sec (in serverStatus order) All panels guarded with `hasXxx` helpers — absent metrics → omitted cards (CC-5). Retro-fits the visualization gap for tasks 01-04. Gates: 239/239 tests in ~750ms (+12 new); bench.js list/compare clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Offline trend tool, NOT a new metric collector — it spawns nothing and adds no metrics.<key>. It reads the already-saved result history under config.results.history, keeps the runs with scenario === "bundle-size", sorts them by timestamp, and prints a Δ table so an operator can spot bundle bloat across runs without grep/jq dances. Flags: --limit N (recent runs, default 5), --format markdown|json (default markdown), --warn-kb N (⚠️ threshold on a positive delta, default 50). Markdown is the default human view (header + per-row delta, "-" for the first row, "+N KB ⚠️" once a jump hits the threshold); JSON emits { trend: [{ tag, client_js_kb, server_kb, total_kb, delta_kb }] } with delta_kb null on the first row for piping to jq. Forward-compatible with old/new result-JSON shapes: it only reads metrics.bundle_size (guarding on a numeric total_kb) plus the top-level scenario/tag/timestamp, so any other field a past or future harness version writes is ignored rather than fatal. A malformed or non-matching history file is skipped, never sinks the trend. Empty history prints a friendly "no runs found" line and exits 0. Split into pure helpers (loadBundleRuns/computeTrend/formatMarkdown/ formatJson) with the io facade injected into the loader, so the 16 new unit tests mock readdirSync/readFileSync instead of touching real disk. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Tasks 05 and 07 implemented in parallel via worktree agents. Due to a harness worktree-collision bug all three task-05/06/07 agents landed in the SAME worktree; agent 06 (the cleanest scope) committed alone, while agents 05 and 07 left their code uncommitted in that shared tree. Their work was complete and tested (278/278 in shared worktree) so this commit consolidates both metric collectors here in main, fills three small gaps the agents missed, and adds the dashboard panels. Task 05 — observer_pool: - apps/tasks-3.x/packages/bench-monitors/observer-pool-sampler.server.js Env-gated (OBSERVER_POOL_OUTPUT). setInterval (1000ms) reads MongoInternals.defaultRemoteCollectionDriver().mongo._observeMultiplexers (server-side path per REVISIONS — spec's Meteor.connection path is client-only). 10_000 sample cap. - runner/observer-pool-aggregator.js — pure {min,max,avg,end} math for both multiplexer_count and handle_count. Null on empty samples. - tests/unit/observer-pool-aggregator.test.js — 8 cases. Task 07 — ddp_messages: - apps/tasks-3.x/packages/bench-monitors/ddp-message-counter.server.js Env-gated (DDP_MESSAGE_OUTPUT). Hooks Session.prototype.send (outgoing) and Meteor.onMessage (incoming) per REVISIONS — the spec's Meteor.server._stream_server field doesn't exist. Reuses the Meteor.server.sessions lazy proto-lookup pattern from propagation-timing.server.js (conn._session is undefined in Meteor 3.x onConnection, so we walk Meteor.server.sessions after setImmediate). High message volume → CC-8 SIGTERM dump-file. - runner/message-rate-aggregator.js — pure totalIn/totalOut + in_per_sec/out_per_sec + by_type. Null on zero-zero totals. - tests/unit/message-rate-aggregator.test.js — 10 cases. Gaps filled (not done by agents — they ran inside a shared worktree and couldn't validate the full plumbing): - drivers/{artillery,script}.js had observerPoolPath wired but NOT ddpMessagePath. Added prepareDdpMessageOutput + passthrough. - tests/unit/metric-keys-contract.test.js — appended observer_pool and ddp_messages to ALLOWED_METRIC_KEYS. - apps/tasks-3.x/packages/bench-monitors/README.md — added rows for both new monitors to the Current monitors table and the file-layout tree. Dashboard panels (apps/dashboard/imports/ui/pages/detail.{html,js}): - Observer pool — Multiplexers/Handles min/max/avg/end table. - DDP messages — totals + rates summary + by_type breakdown (merged from in/out maps, sorted by combined count desc). Both guarded with hasXxx helpers (CC-5 absence → omitted cards). E2E validation on ddp-reactive-light (150 VUs, 33s, oplog driver): - 9 metric keys in result JSON: app_resources, db_resources, gc, mongo_ops, ddp_methods, ddp_subscriptions, live_update_propagation, observer_pool, ddp_messages - DDP messages: 6300 in / 25835 out (179/s · 738/s) — breakdown {in: method:6150, sub:150} / {out: added:7719, result:6150, updated:6150, removed:5517, ready:150, connected:149} - Observer pool: 34 samples — max muxes=1, max handles=1 (deduped cursor, but VU avg-session ~100ms so most samples catch idle state) Gates: 278/278 unit tests in ~750ms; bench.js list + compare clean. Tasks 06 (bundle-delta) was cherry-picked separately as commit abc888c. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Standalone Mongo collector (mirrors task 04's mongo-ops-monitor) that profiles slow ops during the benchmark window and surfaces them under metrics.mongo_slow_queries: total_slow, by_op breakdown, slowest_ms, and a sanitized slowest_sample (ns/op/filter_keys/millis/planSummary). Profile-on-demand pattern: on startup the collector connects to the app DB (default "meteor", overridable via BENCH_MONGO_DB), enables the profiler at slowms=100, and records a benchmarkStart timestamp; on SIGTERM it reads the captured slow ops, aggregates, restores the original profiler config, and emits JSON on stdout. Aggregation lives in a pure, unit-tested module (runner/slow-query-aggregator.js) and runs inside the collector, so the result flows through stopCollectors' generic stdout-JSON drain — no new read block in collectors.js. Three REVISIONS.md (task 12) fixes vs the original spec: 1. Full profiler-config capture+restore. slowms is a sticky GLOBAL Mongo setting that {profile:0} does NOT reset, so we capture {was, slowms, sampleRate} via {profile:-1} and restore all three on shutdown — otherwise every run leaks the harness's slowms=100 into the developer's Mongo. Restore happens before the stdout write so the DB is left clean even if aggregation throws. 2. The query predicate is read from command.filter (not a top-level field). 3. Timestamp-window read of system.profile (ts >= benchmarkStart) instead of the destructive system.profile.drop(). PII safety: slowest_sample.filter_keys carries only filter KEY NAMES, never values (the profile doc holds the full predicate, sensitive on prod data). Absence convention CC-5: empty profile → aggregator returns null → collector writes nothing → key omitted. Init/error paths exit 0 with no stdout. Plumbing: MONGO_SLOW_QUERY_MONITOR const + spawnMongoSlowQueryMonitor + startCollectors push (alongside mongo-ops, gated on mongoUri) in runner/collectors.js. mongoUri already passes through both drivers. One-line extension to the metric-keys contract test. Tests: 15 new aggregator cases (empty/null, mixed op breakdown, slowest by millis, filter_keys sanitization + no value leak, deterministic tie-break, threshold/duration passthrough, missing filter, unknown op, large array, shape contract). Suite 278 -> 293, all green; bench.js list + compare exit 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Standalone out-of-process Mongo collector, same delta-vs-baseline pattern as task 04's mongo_ops. On startup it discovers the app's user collections via db.listCollections() (excluding system.* and any non-alphanumeric-leading internal namespaces) and snapshots each collection's per-index $indexStats — accesses.ops + since. On SIGTERM it re-snapshots, diffs per index, and emits metrics.mongo_index_usage to stdout, which stopCollectors' generic JSON drain already ingests (no special handling). Output shape: { metric: 'mongo_index_usage', collections: { <name>: [{ name, ops_in_window, since, key }] } }. ops_in_window is end.ops − start.ops (what THIS run hit, not lifetime); since is normalized to an ISO string. Normalization mirrors mongo_ops: an index created/first- tracked mid-run (no baseline) counts its full end value; a counter reset (end < start, server restart) uses end; an index dropped mid-run falls out (only end rows are iterated). Never-used indexes (ops_in_window 0) are KEPT so dead indexes that cost write-amplification stay visible. Absence convention CC-5: init failure, no user collections, or zero index rows → no stdout, so the key is omitted and other metrics are unaffected. CC-6: the mongodb driver comes through runner/_io.js (io.MongoClient) for mockability. The diff math is extracted into a pure runner/index-usage-aggregator.js (aggregateIndexUsage) covered by 13 unit tests; total suite 278 → 291. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Standalone Mongo connection-pool sampler that polls serverStatus().connections every second over the benchmark window (a time-series sampler, NOT a start/end delta — connection counts rise and fall as VUs connect/disconnect) and surfaces as metrics.mongo_pool. Mirrors the task 04 mongo_ops collector: spawned out-of-process by startCollectors, emits aggregated JSON on stdout on SIGTERM, flows through the generic JSON drain. REVISIONS.md task 14 fixes, re-verified live against the dev_bundle Mongo 7.0.16 (serverStatus().connections probe): - connections.totalClosed does NOT exist in Mongo 7.0 — dropped total_closed from the output shape. - connections.available is server-side LISTENER HEADROOM (~800k max incoming slots), NOT idle pool connections — dropped; it's not a useful saturation signal. Saturation shows as current ≈ active, so we sample `active` instead and compare current vs active. Output shape: current + active as time-series (min/max/avg/end), total_created as start/end/delta (monotonic counter, only the window delta is meaningful). Aggregation (min/max/avg/end + counter delta) is a pure module (runner/connection-pool-aggregator.js) so it's unit-tested without MongoClient or a child process; returns null when no samples were captured (CC-5 absence → caller omits the key). 13 new unit tests. Files: collectors/mongo-pool-monitor.js (new), runner/connection-pool-aggregator.js (new), tests/unit/connection-pool-aggregator.test.js (new), runner/collectors.js (+spawnMongoPoolMonitor + startCollectors push), tests/unit/metric-keys-contract.test.js (+mongo_pool). Gates: npm test 291/291 (was 278), bench.js list clean, bench.js compare exit 0. Smoke-tested live against mongod:4001. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…mongo_pool Renders the 3 new Mongo metrics (tasks 12/13/14) from the result JSON. Layout: - Mongo connection pool — current/active min/max/avg/end + total_created start/end/delta (REVISIONS-corrected shape, no totalClosed, saturation = current vs active). - Mongo slow queries — by_op count table + slowest sample card (ns + op + ms + filter_keys + planSummary). Filter values are redacted at the collector layer (PII safety per task 12 spec); only key names are surfaced. - Mongo index usage — one full-width card per collection (one row per index) with name, key, ops_in_window, tracked-since timestamp. Wide layout because $indexStats rows have a key spec that needs real estate. Each panel guarded with `hasXxx` helpers; absent metrics → omitted cards (CC-5). Notably mongo_slow_queries IS often absent — the default 100ms slowms threshold means a clean run reports nothing, which is the correct signal. E2E validation on ddp-reactive-light (150 VUs, 33s, oplog driver): - 11 metric keys in result JSON: app_resources, db_resources, gc, mongo_ops, ddp_methods, ddp_subscriptions, live_update_propagation, observer_pool, ddp_messages, mongo_index_usage, mongo_pool. - mongo_slow_queries correctly absent (nothing >100ms in this workload). - mongo_pool: 39 samples — current 18→27 avg 24.4, active 5→10 avg 9.7, total_created delta=9 connections. - mongo_index_usage: taskCollection._id_ index → 3000 ops in window. Gates: 319/319 tests in ~735ms (+41 from the 3 new aggregator suites); bench.js list + compare clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

New bench-monitor that measures the BYTE SIZE of every DDP message in each direction, surfaced as metrics.ddp_frame_size: in/out size distributions (count + avg/p50/p95/p99/max) plus a per-type byte-sum breakdown. Complementary to task 07's ddp_messages, which COUNTS messages — this measures how big they are, so two configs with identical message counts but different byte volumes (partial- vs full-doc, sockjs vs uws framing) are distinguishable. Separate metric key, not an extension of ddp-message-counter. Per-message bytes via Buffer.byteLength(JSON.stringify(msg), 'utf8') inside the Session.prototype.send wrap (outgoing) and the Meteor.onMessage callback (incoming). REVISIONS.md task 08: the hooks see the structured msg object PRE-serialization, so the serialized DDP JSON length is the canonical wire size (pre-compression; task 09 will cover post-compression separately). Same lazy Session-prototype grab as ddp-message-counter (re-implemented, not cross-imported — monitors stay independent). Field naming (CC-4): byte percentiles carry a _bytes SUFFIX (p50_bytes/p95_bytes/p99_bytes, avg_bytes, max_bytes) rather than the bare p50/p95/p99 form — bare exists to match the shipped event_loop_delay contract whose percentiles are in ms; these are bytes, so the unit suffix is required. Percentiles come from the shared lib/percentiles.js summarize (CC-1). CC-8 high-volume: accumulate in memory + flush via installDumpOnShutdown (raw size arrays + per-type byte sums); the harness-side aggregator (runner/frame-size-aggregator.js) computes percentiles. Per-direction sample arrays cap at 200k (bounds memory on >1M-message runs per the spec's sampling note); per-type byte sums keep accumulating past the cap so byte accounting stays complete. Absence (CC-5): no messages either direction → aggregator returns null → key omitted. Gated entirely on DDP_FRAME_SIZE_OUTPUT. Plumbing: re-export from bench-monitors.server.js; call in server/main.js Meteor.startup after initDdpMessageCounter; DDP_FRAME_SIZE_OUTPUT passthrough in runner/meteor-process.js; prepareFrameSizeOutput + read-aggregate-unlink block in runner/collectors.js; wired through both drivers. One-line contract test extension; README monitors table + file-tree updated. Tests: 13 new aggregator cases (null/empty→null, single value, known [1..100] nearest-rank percentiles, both directions, float avg rounding, by_type passthrough + copy-not-reference, _bytes-suffix shape contract, large array, in-only/out-only). Suite 319 → 332, all green; bench.js list + compare exit 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Standalone Mongo collector that polls currentOp every 250ms for in-flight change-stream getMore cursors and surfaces as metrics.mongo_changestream — a time-series cursor_count (min/max/avg/end) plus a per-namespace breakdown (max + avg per ns). Mirrors the task 14 mongo_pool collector: spawned out-of-process by startCollectors when mongoUri is set, emits aggregated JSON on stdout on SIGTERM, flows through the generic JSON drain. Uses the CC-6 mongodb npm driver via io.MongoClient. Aggregation is a pure module (changestream-aggregator.js) so it's unit-tested without MongoClient — 13 new tests. currentOp filter VERIFIED LIVE against the dev_bundle Mongo 7.0.16 (single-node RS) by opening real change streams and watching currentOp. The REVISIONS.md task 24 filter does NOT work as written — two corrections, both confirmed live: 1. Field path is cursor.originatingCommand.pipeline, WITH the `cursor.` prefix. originatingCommand sits under the `cursor` sub-doc of each inprog entry, not top level. (REVISIONS prose says "cursor.originatingCommand.pipeline" but its code block dropped the prefix; the top-level path matches 0, the cursor-prefixed path matches every change stream.) 2. $elemMatch: { $changeStream: { $exists: true } } THROWS "unknown operator: $changeStream" (Mongo parses $changeStream as a query operator inside $elemMatch). The dotted-path form { 'cursor.originatingCommand.pipeline.$changeStream': { $exists: true } } does not throw and matches correctly. Per REVISIONS: 250ms sampling (1 Hz undercounts — getMores complete sub-second), NO idleCursors flag (change-stream getMores are active awaitData ops; idleCursors:true returns 0). Per-namespace keys off op.ns (stable real id, CC-7). Absence (CC-5): exits 0 with no stdout on init error → key omitted; under the oplog driver every sample is just 0. Smoke-tested live: ran the collector against mongod:4001 while holding 3 change streams open (2 on meteor.tasks, 1 on meteor.widgets) → 10 samples, cursor_count max=3, by_namespace {tasks:{max:2}, widgets: {max:1}}, exit 0. Files: collectors/mongo-changestream-monitor.js (new), runner/changestream-aggregator.js (new), tests/unit/changestream-aggregator.test.js (new), runner/collectors.js (+spawnMongoChangestreamMonitor + push), tests/unit/metric-keys-contract.test.js (+mongo_changestream). Gates: npm test 332/332 (was 319), bench.js list clean, bench.js compare exit 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Standalone out-of-process Mongo collector, same start+end delta pattern as task 04's mongo_ops. On startup it snapshots serverStatus().wiredTiger.cache; on SIGTERM it re-snapshots, computes the per-window page-count deltas and the cache hit ratio, captures the end-of-run bytes-in-cache gauge, and emits metrics.mongo_wiredtiger to stdout — which stopCollectors' generic JSON drain already ingests (no special handling). cache_hit_ratio = (pages_requested − pages_read_in) / pages_requested, computed on the DELTAS over the benchmark window (what THIS run hit), not lifetime counters. The four WiredTiger fields are read by their exact human-readable string names ("pages requested from the cache", "pages read into cache", "pages written from the cache", "bytes currently in the cache") — validated against live Mongo 7.0 per REVISIONS, no renames. serverStatus is server-wide so the admin DB handle is used and the ratio reflects the whole mongod's global cache (fine for a Meteor-dominated workload). Normalization mirrors mongo_ops: a counter whose end < start (server restart mid-run) uses the end value; read_in is clamped so the ratio stays in [0, 1]. Absence convention CC-5: init failure, a non-WiredTiger storage engine (wiredTiger sub-doc absent), or zero cache traffic (requested delta 0, which also covers two identical snapshots) → no stdout, so the key is omitted and other metrics are unaffected. CC-6: the mongodb driver comes through runner/_io.js (io.MongoClient) for mockability. The hit-ratio math is extracted into a pure runner/wiredtiger-aggregator.js (aggregateWiredTiger) covered by 12 unit tests; total suite 319 → 331. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Coarse DDP compression ratio per the task 09 spec (the precise per-msg variant needs meteor-source changes that are out of scope; this ships the workable approximation). The metric pairs: - pre-compression bytes: JSON.stringify lengths from the frame-size monitor's existing dump (task 08). - post-compression bytes: socket.bytesRead/bytesWritten deltas across all tracked connections, captured by a new in-app monitor. The harness aggregator (runner/compression-aggregator.js) consumes BOTH dumps in stopCollectors — frame-size is read first and its raw parsed dump is kept in scope for the compression read that follows. If either dump is absent the metric is omitted (CC-5). Output shape (`metrics.ddp_compression`): { out: { uncompressed_bytes, compressed_bytes, ratio, savings_pct }, in: { uncompressed_bytes, compressed_bytes, ratio, savings_pct }, } Ratios > 1 (WS framing overhead inflating tiny-msg traffic) pass through honestly — not clamped. Per-direction ratio is null when uncompressed bytes in that direction are 0 (divide-by-zero guard). Implementation notes: - New compression-tracker.server.js: Meteor.onConnection registers per-conn baseline socket bytes (resolved via 6 candidate paths through the Meteor internals; falls back gracefully + warns once if none match the running Meteor). conn.onClose captures final bytes; the SIGTERM dump also sums live-connection deltas so counts stay self-consistent across the periodic snapshot writes. - All standard plumbing: env-gated init, installDumpOnShutdown, bench-monitors.server.js re-export, server/main.js init call, DDP_COMPRESSION_OUTPUT env passthrough, prepareCompressionOutput in collectors.js, driver wiring, README row + tree, contract test +1 line. Tests: tests/unit/compression-aggregator.test.js (10 cases — both dumps required, all-zero null, typical ratio + savings, divide-by-zero, ratio>1 honest passthrough, 4-decimal rounding, non-numeric coercion, shape lock). Gates: 368/368 (+11 from compression aggregator suite); bench.js list/ compare clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Per-cursor observer driver fallback tracker. The existing startup probe in apps/tasks-3.x/server/main.js measures the driver Meteor picks for ONE throwaway cursor. This metric measures the driver Meteor actually selects for EVERY cursor opened during the run — so a benchmark labeled `oplog` doesn't quietly run 30% polling due to per-cursor fall-through. Surfaces as `metrics.driver_fallbacks`: { metric: 'driver_fallbacks', total_cursors, no_fallback, configured_first, fallbacks: { 'changeStreams_to_oplog': N, 'oplog_to_polling': M, ... } } REVISIONS.md task 10: do NOT wrap `_selectReactivityDriver` — the polling fallback path bypasses it (happens at mongo_connection.js:1188). Instead wrap the connection-instance `_observeChanges` and read the selected driver off `handle._multiplexer._observeDriver.constructor.name` (same internal the startup probe uses). Instance-level wrap (not prototype) is safer here — Meteor normally has one default mongo connection; wrapping the instance avoids any chance of double-wrap from other code touching the prototype. Standard plumbing (env-gated init, installDumpOnShutdown, bench-monitors re-export, server/main.js init, DRIVER_FALLBACK_OUTPUT env passthrough, prepare + spawn wiring in collectors.js, driver wiring, README row, contract test +1 line). Tests: tests/unit/driver-fallback-aggregator.test.js (8 cases — null on total=0, pass-through shape, multiple transitions preserved, missing configured_first → null, non-numeric coercion, defensive copy of fallbacks object, key lock). Gates: 376/376 (+8 from driver-fallback aggregator suite); bench.js list/compare clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Renders the 5 new metrics added this round in detail.html + detail.js: - DDP frame size — per-direction count + avg/p50/p95/p99/max bytes - DDP compression (coarse) — per-direction uncompressed/compressed bytes + ratio + savings_pct - Mongo change-stream cursors — time-series total + per-namespace - Mongo WiredTiger cache — hit ratio + 4 page counters + bytes_in_cache - Driver fallbacks — total observe()s + no_fallback + per-transition counts (e.g. changeStreams_to_oplog: 150) Layout: 3 new rows of 2 cards each, plus driver_fallbacks alone (it's the standout signal — putting it on its own row makes the per-transition breakdown legible). Each panel guarded with hasXxx; absent metrics → omitted cards (CC-5). ALSO ships a small aggregator fix for compression: when uncompressed > 0 but compressed = 0, emit null ratio/savings_pct rather than the nonsense "100% savings" value. That zero-compressed case is the signature of a failed socket-byte capture (compression-tracker.server.js's findRawSocket couldn't resolve the underlying TCP socket on this Meteor/transport combination). Documented honestly via null instead of a misleading number; +1 test case for the detection. E2E validation on ddp-reactive-light (150 VUs, 33s, oplog driver): - 16 metric keys in result JSON (was 11 before this round) - ddp_frame_size: in 132B avg / out 108B avg; by_type_bytes shows `added` dominates outgoing (~2 MB of ~3 MB total) - ddp_compression: emitted with null ratio (socket capture unresolved — known limitation, not silently 100% anymore) - mongo_changestream: 157 samples; max=0 (oplog driver in use, no change-stream cursors live — correct) - mongo_wiredtiger: 99.86% hit ratio · 91k page requests · 132 reads into cache · 28 MB in cache (healthy) - driver_fallbacks: 150/150 cursors fell back changeStreams→oplog (matches runtime-info probe; configured first never used) Gates: 377/377 (+1 from new compression test); bench.js list/compare clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

New build-profile driver runs ONE production build with METEOR_PROFILE=1, parses the profile tree once, and emits TWO metrics from it — build_profile (top-N hottest build nodes by self_ms + long-tail roll-up) and plugin_compile (per-compiler-plugin time). One build serves both, avoiding a second 60-180s build per benchmark. Spec redesigned around METEOR_PROFILE=1 because `meteor build --verbose` emits zero timing. Exploration of a REAL build (tasks-3.x) corrected several brief assumptions, documented in the parser header and captured as a regression fixture (tests/unit/fixtures/meteor-profile-sample.txt): - Profile goes to STDOUT, not stderr. - Each line is `| <indent><name><pad><N> ms[ (<count>)]`. Depth is encoded by 3-column box-drawing groups (│ / ├─ / └─ / spaces); dot-leader vs space padding is cosmetic (has-children), not depth. ms/count carry thousands commas; count is optional on synthetic "other X" lines. - Plugin nodes are top-level entries literally named `plugin <name>` (plugin ecmascript / typescript / static-html / meteor verified live). - total_ms is read from the authoritative `(meteor#1) Total: N ms` line, NOT a tree sum (tree timings nest/overlap → summing double-counts). - The trailing "Top leaves:" block duplicates tree entries and is skipped so it can't double-rank the top-N. Files: runner/meteor-profile-parser.js (pure, defensive — unmatched lines skipped, truncated output parses partially, never throws); runner/build-profile-aggregator.js (rank by self_ms, top-N default 5, children_ms = descendant self_ms sum for top nodes, long_tail_ms = total - top_n_total clamped at 0, null on empty); runner/plugin-compile- aggregator.js (filter `plugin ` prefix, group by stable name, sum on recurrence, null when none). drivers/build-profile.js captures stdout (32MB maxBuffer), resets first for a clean compile, parses partial output on nonzero exit, pushes only non-null aggregates (CC-5). Wired via drivers/index.js, cli/run.js pickDriver (driver: 'build-profile'), bench.config.js scenario. Two contract-test keys under a new Phase D header. Output shapes: build_profile = { metric, total_ms, top_nodes:[{name,self_ms,children_ms, count}], top_n_count, top_n_total_ms, long_tail_ms } plugin_compile = { metric, total_plugin_ms, plugins:{<name>:{self_ms,count}} } Tests: 36 new (15 parser incl. real-fixture assertion, 12 build-profile-agg, 9 plugin-compile-agg). Suite 377 -> 413, all green; bench.js list shows the build-profile scenario; bench.js compare exit 0. E2E left to the consolidated run. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Renders the 2 new build-time metrics from the Phase D round in detail.html + detail.js. Both are full-width single-column cards (below the DDP/Mongo panels) since the table rows can be wide: - Build profile (top hot nodes) — table of {name, self_ms, children_ms, count} for the top-N hot nodes from the METEOR_PROFILE=1 tree. Footer shows total / top-N total / long tail breakdown. - Per-compiler-plugin time — table of {plugin, self_ms, count} sorted by self_ms desc. Footer shows total + plugin count. Each panel guarded with hasXxx; absent metrics → omitted cards (CC-5). Build-time metrics only appear on `build-profile` scenario runs (not DDP scenarios), so the cards naturally hide on the runs that don't emit them. ALSO marks task 22 (hot-reload) as deferred in .claude/metrics-tasks/README.md — the first parallel attempt stalled mid-implementation (4 REVISIONS fixes: Playwright install + chromium, hmr-probe module, console marker listener, SIGINT cleanup handler). The agent worktree + branch were removed cleanly; partial work preserved in git history is none (no commit landed). Task can be re-attempted with fresh context later. E2E validation on ddp-reactive-light's sibling `build-profile` scenario (19.7s wall, of which 9.7s was tracked in the profile): - top hot node: Babel.compile 3543 ms (234 invocations) - long tail: 2128 ms (22% of tracked time) - 4 plugins detected (ecmascript, typescript, static-html, meteor) with tiny times — this app's source is small + isopacks dominate so plugin-compile workload is light. Metric infra works; bigger apps would stress it more. Gates: 413/413 tests in ~845ms; bench.js list/compare clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@theme

…heme toggle Phase A of v2 redesign. Tailwind 4 is wired in via the @tailwindcss/cli watcher writing to client/main.css (Meteor's standard-minifier-css picks that up). PostCSS through the Meteor pipeline turned out flaky with rspack, so the CLI approach is what actually applies utilities. - _tw/main.tailwind.css is the source: imports tailwindcss, declares the class-based dark variant, maps Geist/JetBrains Mono into @theme, and scans both client/ and imports/ for utilities. - client/main.html drops the Bootstrap CDN, loads Geist + JetBrains Mono from Google Fonts, applies sans + dark canvas tokens on <body>. - client/main.js applies the saved theme class on Meteor.startup so there's no FOUC. - layouts/main.html is the new sidebar: brand block (Meteor Benchmark + v2.4.1-stable), Runs/Compare/Trends primary nav with an indigo active left-border, dimmed Settings/Documentation, ☀/☾ theme button + user pinned to the bottom. Mobile gets a thin top bar instead. - layouts/main.js owns the toggle: click → swap <html> class + localStorage('meteor-bench-theme') + a reactive var so the icon updates instantly. Pages still use Bootstrap markup → unstyled until Phase B–E rewrite them. Sidebar + dark canvas verified at http://localhost:4000/. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Phase B. The Runs landing is just one dense table now — no auto-compare panel, no Versions strip; those were v1 over-design. Stitch design implemented closely. Markup: - "Runs" heading + "{n} runs · {n} scenarios" counter at right. - Filter row: scenario <select> + free-text tag search + a clear-✕ button that only renders when a filter is active. - Table: When/Version/Tag/Scenario/Wall/CPU/RAM/GC pause/Δ vs prev, + an ↗ column linking to detail. Tabular-nums via the `font-tabular` helper. Row hover tints neutral-50/900. - Empty state inlines the push command. Loading state too. - "Load more" widens the publication limit by 30 per click. JS: - `whenAgo` for the When column (s/m/h/d, then date). - `versionLabel` falls back to runtime.channel → "local" because most local result JSONs report `meteor.version: "system"`. - Per-row Δ is computed by walking each scenario chronologically and comparing the current run's wall_clock_ms to its predecessor in the same scenario. <5% = neutral grey, regression = orange-500, improvement = green-500. Threshold bands match the v1 compare logic so Runs ↔ Compare colors stay consistent. - statusBadge is gone — it was a fake metric (raw wall time → green badge regardless of scenario), which is misleading. Δ vs prev is the honest signal. Foundation tweaks needed along the way: - Body/<html> background + text color now live in _tw/main.tailwind.css with an html.dark override, because Meteor strips classes on <body>. - Theme toggle moved from #themeToggle (id) to .js-theme-toggle (class) so the mobile and desktop buttons can both wire up to one click handler without DOM-id collisions. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Phase C. Compare is now a regression scoreboard sorted by absolute Δ% descending — biggest movers first, "within noise" last. Markup: - Top filter row: Scenario · Run A · ⇄ swap · Run B. Run selects show "{version} · {tag} · {scenario} · {when}" per option so you can identify a run at a glance. Picking a scenario narrows both A/B option lists and clears prior picks. - Headline strip: "{A} → {B}" mono label + summary counts colored per tier (regressions orange, improvements green, neutral grey) + a "hide within-noise" checkbox. - Scoreboard table: Metric · A · B · Δ abs · Δ % · Status pill · ↗A ↗B per row. Click a row to inline-expand a drilldown sub-table (per-method for ddp_methods, per-op for mongo_ops, etc.) — wired via ReactiveDict keyed on metric path. - Footer line lists metrics that only exist in one of the two runs ("only in A" / "only in B"). JS: - Single `M` array of metric extractors so adding a new comparable metric is one entry. drilldown() is optional and feeds the expand panel. - classify(delta, unit): hardcoded bands <5% neutral / <25% warn / ≥25% big. Sign-direction is unit-aware — for ms/mb/pct/bytes (higher = worse) positive Δ becomes regression; for count metrics it's "info" (no value judgement, just movement). - pctDelta + fmt handle nulls and unit-specific formatting. - Subscribes runs.recent(200), filters scenario-side client-side from the publication. No new pub needed. Bootstrap is now fully gone from this page — every class is Tailwind or one of our two helpers (font-tabular, font-mono via the Tailwind mono var). No Bootstrap collapse component either; the expand is just a {{#if expanded}} guard and a ReactiveDict toggle. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Phase D. Single run drill-down redesigned per the Stitch reference: breadcrumb / header band / sticky left nav-pill rail / grouped sections. Markup: - Breadcrumb: Runs / {scenario} / {tag-as-mono}. - Header band: big mono {tag} as the h1, then a small inline row with the version pill + (optional) sha + scenario link + date + wall clock. Right side: Compare ▾ (indigo primary) + Prev run (ghost). - Verdict line under the header: "▲ +18% wall vs prev (release/3.4)" colored per band. Computed against the previous run of the same scenario. - Two-column body: sticky left rail with Overview/DDP/Mongo/Observer/ Build/Not in run anchor pills (each section is conditional and the pill is only emitted when the section has anything to show), and the right column with each section as an h2 + 2-up grid of metric cards. - "Not in this run" footer lists every absent metric family as muted mono pills, so analysts can tell apart "we know this was absent" vs "we just forgot to ship it". JS: - Kept every existing `hasXxx` / formatting helper logic from the prior detail.js — same `this.metrics.<key>` paths, same null-guarding via absence convention. Just adapted the return shape to feed a single `metricCard` partial (label/value pairs OR a tableHeaders + tableRows cells array). This keeps the markup tiny and consistent across the ~14 metric families. - Section flags (hasDdpSection / hasMongoSection / hasObserverSection / hasBuildSection / hasMissingSection) drive the rail + section visibility from one anyOf() check per group. - verdict() / prevRunId() walk the same-scenario predecessor for the Δ banner and the ← Prev run button. The metricCard partial is the only piece of shared infrastructure on the page — one component covers both "key/value table" and "real multi-column table" + optional header badge + footer note. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Phase E. Trends: - One row of filters: Scenario · Metric (with <optgroup> by family — Runtime / DDP / Mongo / Build) · Segment by [version | tag] toggle · Date range (7/30/90/all) · right-aligned "n runs · m versions" counter that reflects the live filter. - One full-width Chart.js line chart in a card. Brand palette (indigo / green / orange / teal / pink), cycled per segment key so colors are stable across re-renders. - Vertical dotted lines mark each version first-appearance, with a small mono label at the top edge ("local →"). Implemented via an inline Chart.js plugin reading versionBoundaries. - Tabular numerals on both axes via JetBrains Mono. Grid + tick colors react to the dark class on <html>. - Point click → /run/:id of that run. - Custom legend rendered below the chart from the same chartStats ReactiveVar that powers the counter, so legend and counts can never drift. - canvas lives outside {{#if hasData}} (toggled via a `hidden` class) because Blaze removes the canvas when the if flips false, and the autorun otherwise can't find the canvas at the moment data arrives. Scenario: - Breadcrumb (Runs / {scenario}), big mono scenario title with run count + driver as a one-line caption, and primary Compare runs + ghost View trends buttons at the right. - Two cards in a 50/50 grid: About this scenario (prose) and At a glance (Driver badge / Virtual users / Duration / Requires browser). - Technical details collapsed accordion (▸/▾ chevron); body holds the long-form technical description with inline mono code spans. State via a ReactiveVar so we don't pull in Bootstrap collapse. - "Runs for this scenario" — same dense table shape as Runs landing but filtered by scenario, no Δ column (the scenario itself is the constant, so prev-run deltas live on Compare/Detail). - The static SCENARIOS dict was inlined from the prior file. Added a build-profile entry for completeness. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…Detail Local result JSONs ship meteor.version = "system" and meteor.sha = "unknown" when no Meteor source was wired in. Showing those as literal table values is noise. Now we just omit the row when the value is one of those sentinels, matching how the existing versionLabel() helper handles them in the header pill. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

User-facing labels only — JSON contract (wall_clock_ms), dropdown option values, and internal metric keys are unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Mongo accepts dots in object keys server-side, but minimongo rejects them on the client — a single dotted key (e.g. metrics.mongo_changestream indexed by `<db>.<collection>`) breaks the whole runs publication, so affected metrics silently vanish from the dashboard. Recursively replace '.' with '_' in keys at insert time. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Metric cells render via {{{value}}} (raw HTML) so the <code>/<span> wrappers take effect. Run data (mongo namespaces, index/plugin/node names, tags, sha) is machine-generated but untrusted (pushed over DDP with the bench API key), so HTML-escape every interpolated value to prevent stored XSS. Also intercept sidebar `#anchor` clicks — FlowRouter swallows them — and scroll to the section manually. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

via shell). Concat instead of overwrite.

…ear + scenario meta - runs.js: stop sanitizeKeys from recursing into Date (it rebuilt the timestamp as {} → "Invalid Date"); add authenticated runs.clear method - dashboard.html/js: remove the "Δ vs prev" column + its dead delta calc - scenario.js: add ddp-reactive-extended metadata (fixes "Unknown scenario") - main.html: version label v2.4.1-stable → v0.1.0-beta Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- bench.js/cli: new `bench.js clear --confirm` dashboard command (mirrors push/baseline over DDP) to wipe all runs via the runs.clear method - bench.config.js: register ddp-reactive-5min and ddp-reactive-extended - artillery/ddp-reactive-extended.yml: ~7-min sustained DDP load profile (2→5→10 VU/s) sized for a capable machine, for the observer-driver × transport benchmark matrix - artillery/ddp-reactive-5min.yml: track the existing 5-min profile Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

dupontbertrand and others added 30 commits March 25, 2026 19:02

Fix: fetch release branches in CI clone (--no-single-branch + explici…

e66e3fe

…t fetch)

Fix: show comparison report in CI logs before exit

a54244e

Fix: ensure results/ dir exists for GC monitor, add debug logging

74fe9b3

Set dashboard Meteor release to 3.4 for Galaxy deploy

d29efbf

Security: add deny rules on collections, remove personal path from .e…

b6d77d3

…nvrc

Fix default baseline branch name: release-3.5 (not release/3.5)

f962c22

Fix nightly workflow: use release branches, light scenario only, fix …

b1ac2d2

…compare

Update README with benchmark framework documentation

de704d4

Fix Galaxy cold start: wake up dashboard before pushing results

297ef61

Add curl + sleep before DDP push to wake Galaxy free tier from cold start. Add timeout-minutes: 2 on push steps to avoid 30min CI hangs.

Fix cold start: retry loop up to 5 min instead of sleep 15

75c95d7

Update dashboard URL to new Galaxy sandbox app

c9a9283

Fix: trim BENCH_DASHBOARD_URL env var

90afaf3

Implement cold-start and bundle-size CLI scenarios

8b57186

cold-start: runs meteor reset + meteor run N times, reports median/min/max startup time bundle-size: runs meteor build --directory, reports client JS, server, total bundle size + build time

Add --env flag to pass environment variables to Meteor process

e482323

Usage: node bench.js run --scenario ddp-reactive-light --env DDP_TRANSPORT=uws Supports multiple: --env DDP_TRANSPORT=uws --env MONGO_OPLOG_URL=... Injected into all Meteor spawns (run, script, cold-start, bundle-size)

Add transport benchmark workflow: sockjs vs uws matrix

d2d5448

Runs the same scenario on the same branch with DDP_TRANSPORT=sockjs and DDP_TRANSPORT=uws in parallel. Pushes both results to dashboard for comparison.

italojs and others added 30 commits June 2, 2026 13:37

chore(dashboard): rename "Wall" column to "Duration"

11cc888

User-facing labels only — JSON contract (wall_clock_ms), dropdown option values, and internal metric keys are unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

chore(dashboard): add Galaxy deploy config + ignore .galaxy/

73c504b

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

PATCH-PROF: respect existing SERVER_NODE_OPTIONS (e.g. --prof passed

9bf0402

via shell). Concat instead of overwrite.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor: from ad-hoc benchmark scripts to a Meteor Benchmark Platform#22

Refactor: from ad-hoc benchmark scripts to a Meteor Benchmark Platform#22
italojs wants to merge 80 commits into
meteor:mainfrom
italojs:redesign/v2-blaze-tailwind

italojs commented Jun 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

italojs commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Refactor: from ad-hoc benchmark scripts to a Meteor Benchmark Platform

Summary

Why

What changed

1. New CLI & harness architecture

2. bench-monitors Meteor package (server-side instrumentation)

3. Metric collectors & aggregators (tasks 01–24)

4. Results dashboard (apps/dashboard/) — design v2

5. Runtime observability & configuration matrix

6. CI workflows

7. Test suite

8. Cleanup / removals

Migration notes

How to test

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

italojs commented Jun 30, 2026 •

edited

Loading

2. `bench-monitors` Meteor package (server-side instrumentation)

4. Results dashboard (`apps/dashboard/`) — design v2