Refactor: from ad-hoc benchmark scripts to a Meteor Benchmark Platform#22
Open
italojs wants to merge 80 commits into
Open
Refactor: from ad-hoc benchmark scripts to a Meteor Benchmark Platform#22italojs wants to merge 80 commits into
italojs wants to merge 80 commits into
Conversation
- bench.js CLI: run, compare, push, baseline, list commands - Collectors: CPU/RAM (pidusage), GC (perf_hooks), event loop delay - Regression detector with configurable thresholds and markdown output - Blaze dashboard app (Meteor 3, Bootstrap 5, Chart.js) - Pages: Dashboard, Compare, Trends, Run Detail - DDP methods for pushing results from CLI - Artillery light scenario (30 VUs) for quick CI runs - GitHub Actions workflows: PR benchmark + nightly
Two new scenarios using SimpleDDP + ws (no Playwright/Chromium): - ddp-reactive-light: subscribe + CRUD (150 VUs, 30s) - ddp-non-reactive-light: methods-only CRUD (150 VUs, 30s) Isolates server/DDP performance from browser rendering overhead.
Add curl + sleep before DDP push to wake Galaxy free tier from cold start. Add timeout-minutes: 2 on push steps to avoid 30min CI hangs.
Clicking a scenario name in dashboard or detail view opens a page with: - Simple description (what does this scenario do?) - At-a-glance table (driver, VUs, duration, browser required) - Technical details (DDP flow, oplog, collectors) - Recent runs for this scenario
cold-start: runs meteor reset + meteor run N times, reports median/min/max startup time bundle-size: runs meteor build --directory, reports client JS, server, total bundle size + build time
New driver type 'script' for standalone Node.js benchmarks. fanout-bench.js: connects N subscribers, 1 writer does inserts, measures time for all subscribers to receive the reactive update. Reports p50/p95/p99/avg/max fanout latency + CPU/RAM/GC.
Usage: node bench.js run --scenario ddp-reactive-light --env DDP_TRANSPORT=uws Supports multiple: --env DDP_TRANSPORT=uws --env MONGO_OPLOG_URL=... Injected into all Meteor spawns (run, script, cold-start, bundle-size)
Runs the same scenario on the same branch with DDP_TRANSPORT=sockjs and DDP_TRANSPORT=uws in parallel. Pushes both results to dashboard for comparison.
Aligns the harness's declared runtime with the version we want to target across the refactor. Workflows previously pinned Node 22; volta and package.json engines declared 20.x. After this commit everything agrees on >=24. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Meteor 2.x is end-of-life and no CI workflow references the tasks-2.x benchmark app. Removing it shrinks the harness scope to the supported Meteor 3 surface — fewer apps to keep building, fewer env vars to document, one less map entry to wire scenarios against. The bench.config.js apps map, README structure tree, and RUNTIME deploy docs now reflect tasks-3.x as the sole benchmark target. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Meteor auto-discovers <app>/packages/, so the top-level packages/ directory needed a METEOR_PACKAGE_DIRS injection at every meteor spawn to be picked up. Moving tasks-common under apps/tasks-3.x/packages/ removes that ceremony — the package now lives where Meteor expects it and bench.js drops four redundant env-var assignments. apm-agent only served two orphaned shell scripts (monitor.sh, deploy.sh) and was user-approved for deletion; SCRIPTS.md (commit 13) will flag the caveat for anyone who needs to re-add it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The otel/ directory held a Grafana Tempo + collector stack that no script, workflow, or harness module references — confirmed by grep across the repo. Removing it from the working tree (it was never git-tracked) drops the unused infra spec, and the RUNTIME.md Deploy section sheds its MontiAPM/ENABLE_APM narrative since the public harness path (bench.js) doesn't toggle APM. ENABLE_APM remains live in the orphan scripts/monitor.sh and scripts/deploy.sh; commit 13's SCRIPTS.md will document those as legacy ops helpers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The workflow accepted repository_dispatch payloads and interpolated client_payload.* values directly into `run:` shell blocks (and into a github-script body), which is the GitHub Actions command-injection pattern documented in https://github.blog/security/vulnerability-research/how-to-catch-github-actions-workflow-injections-before-attackers-do/ A caller able to dispatch the `benchmark-pr` event could inject shell commands via fields like client_payload.branch. Fix: every value that flows from event payload or inputs into a shell or script body now goes through the step's `env:` block first, and the body references the value via $VAR (shell) or process.env.VAR (github-script). Downstream steps that consumed steps.params.outputs.* in shell get the same treatment. Behavior is unchanged. Nightly and Transport workflows trigger only on schedule/workflow_dispatch (no external payload), so they are out of scope for this commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Node 24 ships with first-class ESM and node:util.parseArgs, removing the last reasons to keep the harness on CJS + minimist. Adding "type": "module" lets every harness module use top-level static imports, dropping the dual require()/dynamic-require pattern in bench.js (simpleddp/ws were lazy-required just to delay the dependency cost). parseArgs gives us a typed schema for every flag the CLI accepts, including the repeatable --env KEY=VALUE (multiple: true). The CLI contract is byte-identical: same flag names, same exit codes, same help text. gc-monitor stays CJS as gc-monitor.cjs since Node's --require loader is CJS-only — the path reference in bench.js is updated and an explicit comment at the top of the file flags this so no one converts it on a future pass. The dead regression-detector require.main === module CLI shim is removed; bench.js compare is the sole public entrypoint. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires npm test to node:test and covers the pure paths of the harness
that exist today: splitEnvArgs, regression-detector (compare +
toMarkdown), buildResult shape/keying/fallback, and writeResult +
appendToHistory using os.tmpdir() for fs assertions. Six fixtures under
tests/unit/fixtures/ — baseline plus targets for pass / regression /
improvement / zero-baseline / non-finite scenarios — match the
collector-produced JSON schema the dashboard reads.
regression-detector tests for zero-baseline, null target, NaN, and
Infinity pin the CURRENT (buggy) silent-skip behavior with explicit
"TODO commit 11" comments, so the diff at commit 11 is unambiguous.
The buildResult tests skip the meteorCheckoutPath git-shelling branch
because mock.method against the node:child_process namespace fails with
"Cannot redefine property: execSync" — that branch is being deleted in
commit 7 (buildResult becomes pure, taking meteor: {version, sha} as
input), and commit 7 will add coverage for the pure signature directly.
48 tests pass in ~90ms; well under the 5s npm test budget. No Meteor,
no network. node bench.js list and node bench.js compare against the
new fixtures both work; compare exits 0 on the passing-target fixture.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Meteor introspection now has a single source of truth. The previous
duplication — git-shelling once inside reporters/json-reporter.js
buildResult and once inside bench.js getMeteorInfo — collapses into
resolveMeteorSource({flags, env, config}) in meteor-source.js, called
once per command and threaded through as a source object. buildResult
becomes pure: it takes meteor: {version, sha} as input, no shells out.
The two try/catches that silently returned 'unknown' on git failure
are gone — getMeteorInfo now throws an actionable error naming the
checkout path and asking "Is this a git checkout?" so misconfigured
runs fail loud instead of producing JSON with sha='unknown' that hides
the real problem. The function signature already accepts the
meteor-version / METEOR_RELEASE / config.meteorVersion inputs commit 8
will wire in, so commit 8 only adds the release-mode branch. Git
shelling uses execFileSync('git', [...]) instead of execSync(cmd) —
no shell, no possibility of injection from the checkout path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lets users benchmark a published Meteor release without checking out
the source — useful for reproducible CI runs and comparing released
versions. resolveMeteorSource extends commit 7's checkout/system
branches with a release branch that pre-bakes meteor.version=<version>
and meteor.sha='release:<version>'; the 'release:' prefix keeps the
existing dashboard JSON contract (non-empty strings) while making the
mode visible at a glance. Inputs come from flag > env > config:
--meteor-version, METEOR_RELEASE, config.meteorVersion. The mode is
mutually exclusive with checkout — both set with usable values throws
an error naming the conflicting strings ("got version=X and
checkout=Y. Pick one."). The exclusion check requires the checkout
binary to actually exist, so a stale config.meteorCheckoutPath default
doesn't block --meteor-version. Every meteor spawn site picks up
source.releaseArg via a small meteorArgv/meteorShellPrefix helper —
9 sites, 0 duplication. README's new "Meteor source" section
documents both modes side-by-side.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Eliminates the 80-line app-lifecycle duplication between cmdRun and
cmdScript in bench.js by moving install-deps, reset, start, wait-for,
collectors, and stop into runner/. waitForApp swaps the legacy
execSync('curl ...') + execSync('sleep 1') polling loop for native
fetch + node:timers/promises, with a clear actionable error on
timeout. findPid keeps its one tight try/catch — pgrep exits 1 on no
match, that's documented as an expected absence so callers can treat
null as "skip this collector".
The runner sits behind runner/_io.js — a plain-object I/O facade that
re-binds node:child_process / node:fs / node:timers/promises functions
and a fetch wrapper as configurable properties. ESM namespace exports
(including re-exports from node:*) are non-configurable, so tests
hitting mock.method against them throw "Cannot redefine property"; the
plain object dodges that without DI scaffolding or {exec, fs, spawn}
function params. Production reads `io.execSync(...)` instead of
`execSync(...)` — a three-character prefix in exchange for a fully
mockable boundary.
Also closes the latent shell-injection surface QA flagged on commit 8:
meteor reset / meteor run / npm install all switch from
execSync(`${meteorCmd} subcommand`) template-literal shell calls to
io.execFileSync(meteorCmd, argv) / io.spawn(meteorCmd, argv). No
shell, no parsing of source.releaseArg or app paths as shell input,
no possibility of injection from user-controlled inputs even if a
future caller passes hostile data. cmdBundleSize's `du -sk` and
`rm -rf` shell-outs remain — spec'd for commit 10's drivers/bundle-
size.js conversion.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bench.js shrinks from 547 lines to 93 — now just parseArgs schema,
command switch, and inline help. Every command body lives in a
single-purpose cli/ module that calls into a driver and persists the
result. Each driver owns one scenario kind (artillery, script, cold-
start, bundle-size) and returns a buildResult-shaped object so the
runtime contract stays uniform across all 10 scenarios.
drivers/bundle-size.js drops the last shell-outs: `du -sk` becomes a
recursive io.statSync sum, `rm -rf` becomes io.rmSync({recursive:true,
force:true}), and `meteor build` becomes io.execFileSync(argv) like
the other meteor calls in commit 9. Zero template-literal shell calls
remain in the new code; bench.js no longer imports execSync at all.
runner/_io.js extends to 15 keys: statSync + rmSync for bundle-size,
SimpleDDP + ws for cli/dashboard.js. Per REFACTOR_SPEC.md hard-
constraint meteor#4's approved exception, this is the canonical io facade
for the whole codebase — drivers/ and cli/ reuse runner/_io.js rather
than forking it. drivers/index.js mirrors the same plain-object
dispatch pattern so cli/run.js can pick a driver via switch and tests
can mock.method individual drivers.
cli/compare.js gains an actionable error path: missing or unparseable
result files now exit 1 with the file path and a next-step hint
("Check the path or run 'bench.js run' first to produce it." /
"Is the file a valid bench.js result?") instead of an unhandled
exception. cli/run.js's unknown-scenario / unknown-app errors list
the valid options rather than just naming what was wrong. Both are a
small commit-12 preview that lands here because tests need to assert
against them.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Measures server-internal write-to-emit latency: time from
Collection.insertAsync resolving to the moment Meteor's observer
emits the corresponding `added` DDP message for EACH subscribed
client. Surfaces as `metrics.live_update_propagation` (flat
aggregate — not per-pub, not per-doc). This is THE metric that
attributes wall-clock differences across observer drivers
(changeStreams vs oplog vs polling) to actual propagation paths
instead of CPU/RAM symptoms.
Hooks (both prototype-level — Mongo.Collection and Session classes
patched once so every instance and every future Collection picks
the wrap up automatically):
1. Mongo.Collection.prototype.insertAsync → post-await,
map.set(docId, Date.now()).
2. Session.prototype.sendAdded + sendChanged → if docId in map
AND elapsed <= ATTRIBUTION_TTL_MS (10s), record sample.
Why server-side Map keyed by docId, NOT the spec's __benchPushedAt
in-doc field:
- In-doc field pollutes Mongo schema permanently.
- Initial-batch contamination: when a new sub connects, the
observer's initial fetch fires sendAdded for ALL existing docs.
Stale __benchPushedAt values would record ancient timestamps as
"propagation". The Map + 10s TTL filters this out cleanly.
This is REVISIONS.md task 03 spec-spirit but the implementation
diverges from the prose: REVISIONS suggested __benchPushedAt; the
gotcha was uncaught in review.
Gated entirely on PROPAGATION_TIMING_OUTPUT — without the env var,
init is a complete no-op and Mongo writes are never wrapped (zero
overhead in dev). Same gate applied retroactively to method-timing
and sub-timing for consistency: was always-on wrap with env-gated
file output; now env-gated wrap + file output.
Rule-of-three refactor:
- apps/tasks-3.x/packages/bench-monitors/_dump-on-shutdown.js —
extracted shared SIGTERM/SIGINT/beforeExit dump-once helper
(~25 lines). Replaces 3× inline copies across method/sub/
propagation timing.
Plumbing mirrors tasks 01 + 02:
- bench-monitors package: new propagation-timing.server.js +
re-export from bench-monitors.server.js.
- server/main.js: initPropagationTiming() before registerTaskApi.
- runner/meteor-process.js: PROPAGATION_TIMING_OUTPUT env passthrough.
- runner/collectors.js: preparePropagationTimingOutput +
aggregatePropagationTiming (flat-array variant) + read-on-stop
with absence guard.
- drivers/{artillery,script}.js: call prepare + pass path through
start/stopCollectors.
Tests:
- tests/unit/propagation-timing-percentiles.test.js — 11 cases
covering empty/null, single sample, 1000-sample percentiles,
BARE-percentile contract, _ms-suffix contract, flat-shape lock
(no per-pub/per-doc keys), zero-valued samples kept.
- tests/unit/metric-keys-contract.test.js — extends ALLOWED set
by one line (live_update_propagation).
bench-monitors README:
- Added propagation row to the monitors table.
- "Known limitations" section: updates not wrapped (insertions
only), TTL trade-off, polling driver dominated by interval.
- "How a monitor works" rewritten around the env-gate-first
pattern + reference to _dump-on-shutdown.js helper.
- File-layout tree updated.
Gates: 227/227 tests in ~727ms (+11 new); bench.js list/compare clean.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two bugs surfaced during E2E validation of the Phase A metrics (commit ec7c1fa shipped 3 monitors but none of them actually wrote their dump files in a real benchmark run). 1) installDumpOnShutdown — relied solely on SIGTERM/SIGINT/beforeExit handlers to flush samples. The meteor parent process kills its node child without reliably forwarding SIGTERM (collectors/gc-monitor.cjs calls this out at line 88, which is why gc-monitor uses a periodic snapshot). Our helper inherited the buggy pattern and silently lost ALL data — no file ever appeared. Result JSON had no ddp_methods / ddp_subscriptions / live_update_propagation keys despite the in-app collectors recording samples correctly. Fix: write the output file every 5s via setInterval (unref'd so it doesn't block exit). Signal handlers stay as best-effort capture of the last 0-5s. Same shape gc-monitor.cjs uses. 2) propagation-timing.server.js — REVISIONS.md task 03 had wrong code: `Meteor.onConnection(conn => { const session = conn._session; ... })` always saw `conn._session === undefined`. The connection object in Meteor 3.x doesn't expose `_session` — the actual Session lives in `Meteor.server.sessions` (Map keyed by id), created AFTER the DDP `connect` message arrives (post-onConnection). Result: prototype never patched, sendAdded never recorded, samples array empty. Fix: rewrote `tryPatchSessionProto()` to walk `Meteor.server.sessions` and grab the prototype off any live session (all share it). Called lazily from THREE places: onConnection (deferred via setImmediate so Meteor has a tick to create the session), AND inside the insertAsync wrap (covers scenarios where sub→insert ordering races against session creation), AND idempotent on re-entry. Handles both Map and plain-object shapes of Meteor.server.sessions for forward/backward compat. E2E validation on ddp-reactive-light (150 VUs, 30s): - ddp_methods: 6150 calls, 3 methods, p99 insertTask=0.93ms - ddp_subscriptions: 150 subs, p99 fetchTasks=3.05ms - live_update_propagation: 8177 observed updates, p50=1ms, p95=44ms, p99=52ms (observer=oplog) Result JSON now contains all three new metric keys end-to-end. Bench-monitors README updated to reflect the periodic-snapshot-first flush pattern. Unit tests unchanged: 227/227 in 635ms. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Mongo opcounters — per-second rates for insert / query / update /
delete / getmore / command, computed from serverStatus().opcounters
delta over the benchmark window. Answers "different work or same work
done differently?" when comparing observer drivers. Surfaces as
`metrics.mongo_ops`:
{
metric: 'mongo_ops',
duration_s: 35.27,
ops_per_sec: { insert: 85, query: 176, delete: 89, getmore: 156, ... },
totals: { insert: 3000, query: 6213, delete: 3150, getmore: 5526, ... }
}
Implementation (CC-6 — mongodb npm driver, not the mongo shell which
Meteor 7.0 dev_bundle dropped):
- Added `mongodb` as harness production dep.
- Exposed `MongoClient` on runner/_io.js (testable via mock.method).
- collectors/mongo-ops-monitor.js — standalone ESM script, spawned by
the harness alongside process-monitor. Connects to target Mongo,
reads serverStatus().opcounters at startup baseline and again on
SIGTERM, dumps JSON to stdout. Same drain shape as process-monitor.
- runner/mongo-ops-rates.js — pure rate-math, extracted so it's
unit-testable without spawn / MongoClient.
- runner/collectors.js — startCollectors gained `mongoUri` param,
spawns mongo-ops-monitor when set.
- drivers/{artillery,script}.js — derive URI as
`mongodb://127.0.0.1:${appPort + 1}` (Meteor's local Mongo port),
overridable via `BENCH_MONGO_URL` for external Mongo (Galaxy etc.).
- Collector skips silently when Mongo isn't reachable (logs to stderr,
exits 0 with no stdout → stopCollectors omits the key per absence
convention CC-5).
Tests:
- tests/unit/mongo-ops-rates.test.js — 12 cases covering empty/zero
activity, counter reset (end<start → treat delta=end), divide-by-
zero on sub/zero/negative durations, new opcounter keys in future
Mongo versions, null start, numeric coercion.
- metric-keys-contract.test.js — extends ALLOWED set by one line.
E2E validation on ddp-reactive-light (150 VUs, 33s):
- inserts: 3000 total / 85 ops/sec (exact match — 150 VUs × 20 each)
- deletes: 3150 total / 89 ops/sec (3000 removeTask + 150 removeAllTasks)
- queries: 6213 / 176 ops/sec (observer reads, change-stream lookups)
- getmore: 5526 / 156 ops/sec (change-stream cursor follow-ups)
- oplog driver run; numbers will differ across changeStreams/polling.
Dashboard panels (apps/dashboard/imports/ui/pages/detail.{html,js}):
- DDP Methods — per-method count + avg/p95/p99/max (sorted by count)
- DDP Subscriptions — per-publication count + avg/p95/p99/max
- Live-update propagation — observed_updates + avg/p50/p95/p99/max
- Mongo opcounters — per-op total + ops/sec (in serverStatus order)
All panels guarded with `hasXxx` helpers — absent metrics → omitted
cards (CC-5). Retro-fits the visualization gap for tasks 01-04.
Gates: 239/239 tests in ~750ms (+12 new); bench.js list/compare clean.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Offline trend tool, NOT a new metric collector — it spawns nothing and adds no metrics.<key>. It reads the already-saved result history under config.results.history, keeps the runs with scenario === "bundle-size", sorts them by timestamp, and prints a Δ table so an operator can spot bundle bloat across runs without grep/jq dances. Flags: --limit N (recent runs, default 5), --format markdown|json (default markdown), --warn-kb N (⚠️ threshold on a positive delta, default 50). Markdown is the default human view (header + per-row delta, "-" for the first row, "+N KB⚠️ " once a jump hits the threshold); JSON emits { trend: [{ tag, client_js_kb, server_kb, total_kb, delta_kb }] } with delta_kb null on the first row for piping to jq. Forward-compatible with old/new result-JSON shapes: it only reads metrics.bundle_size (guarding on a numeric total_kb) plus the top-level scenario/tag/timestamp, so any other field a past or future harness version writes is ignored rather than fatal. A malformed or non-matching history file is skipped, never sinks the trend. Empty history prints a friendly "no runs found" line and exits 0. Split into pure helpers (loadBundleRuns/computeTrend/formatMarkdown/ formatJson) with the io facade injected into the loader, so the 16 new unit tests mock readdirSync/readFileSync instead of touching real disk. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Tasks 05 and 07 implemented in parallel via worktree agents. Due to a
harness worktree-collision bug all three task-05/06/07 agents landed in
the SAME worktree; agent 06 (the cleanest scope) committed alone, while
agents 05 and 07 left their code uncommitted in that shared tree. Their
work was complete and tested (278/278 in shared worktree) so this
commit consolidates both metric collectors here in main, fills three
small gaps the agents missed, and adds the dashboard panels.
Task 05 — observer_pool:
- apps/tasks-3.x/packages/bench-monitors/observer-pool-sampler.server.js
Env-gated (OBSERVER_POOL_OUTPUT). setInterval (1000ms) reads
MongoInternals.defaultRemoteCollectionDriver().mongo._observeMultiplexers
(server-side path per REVISIONS — spec's Meteor.connection path is
client-only). 10_000 sample cap.
- runner/observer-pool-aggregator.js — pure {min,max,avg,end} math
for both multiplexer_count and handle_count. Null on empty samples.
- tests/unit/observer-pool-aggregator.test.js — 8 cases.
Task 07 — ddp_messages:
- apps/tasks-3.x/packages/bench-monitors/ddp-message-counter.server.js
Env-gated (DDP_MESSAGE_OUTPUT). Hooks Session.prototype.send
(outgoing) and Meteor.onMessage (incoming) per REVISIONS — the
spec's Meteor.server._stream_server field doesn't exist. Reuses the
Meteor.server.sessions lazy proto-lookup pattern from
propagation-timing.server.js (conn._session is undefined in
Meteor 3.x onConnection, so we walk Meteor.server.sessions after
setImmediate). High message volume → CC-8 SIGTERM dump-file.
- runner/message-rate-aggregator.js — pure totalIn/totalOut +
in_per_sec/out_per_sec + by_type. Null on zero-zero totals.
- tests/unit/message-rate-aggregator.test.js — 10 cases.
Gaps filled (not done by agents — they ran inside a shared worktree
and couldn't validate the full plumbing):
- drivers/{artillery,script}.js had observerPoolPath wired but NOT
ddpMessagePath. Added prepareDdpMessageOutput + passthrough.
- tests/unit/metric-keys-contract.test.js — appended observer_pool
and ddp_messages to ALLOWED_METRIC_KEYS.
- apps/tasks-3.x/packages/bench-monitors/README.md — added rows for
both new monitors to the Current monitors table and the file-layout
tree.
Dashboard panels (apps/dashboard/imports/ui/pages/detail.{html,js}):
- Observer pool — Multiplexers/Handles min/max/avg/end table.
- DDP messages — totals + rates summary + by_type breakdown (merged
from in/out maps, sorted by combined count desc).
Both guarded with hasXxx helpers (CC-5 absence → omitted cards).
E2E validation on ddp-reactive-light (150 VUs, 33s, oplog driver):
- 9 metric keys in result JSON: app_resources, db_resources, gc,
mongo_ops, ddp_methods, ddp_subscriptions, live_update_propagation,
observer_pool, ddp_messages
- DDP messages: 6300 in / 25835 out (179/s · 738/s) — breakdown
{in: method:6150, sub:150} / {out: added:7719, result:6150,
updated:6150, removed:5517, ready:150, connected:149}
- Observer pool: 34 samples — max muxes=1, max handles=1 (deduped
cursor, but VU avg-session ~100ms so most samples catch idle state)
Gates: 278/278 unit tests in ~750ms; bench.js list + compare clean.
Tasks 06 (bundle-delta) was cherry-picked separately as commit abc888c.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Standalone Mongo collector (mirrors task 04's mongo-ops-monitor) that
profiles slow ops during the benchmark window and surfaces them under
metrics.mongo_slow_queries: total_slow, by_op breakdown, slowest_ms, and a
sanitized slowest_sample (ns/op/filter_keys/millis/planSummary).
Profile-on-demand pattern: on startup the collector connects to the app DB
(default "meteor", overridable via BENCH_MONGO_DB), enables the profiler at
slowms=100, and records a benchmarkStart timestamp; on SIGTERM it reads the
captured slow ops, aggregates, restores the original profiler config, and
emits JSON on stdout. Aggregation lives in a pure, unit-tested module
(runner/slow-query-aggregator.js) and runs inside the collector, so the
result flows through stopCollectors' generic stdout-JSON drain — no new read
block in collectors.js.
Three REVISIONS.md (task 12) fixes vs the original spec:
1. Full profiler-config capture+restore. slowms is a sticky GLOBAL Mongo
setting that {profile:0} does NOT reset, so we capture {was, slowms,
sampleRate} via {profile:-1} and restore all three on shutdown —
otherwise every run leaks the harness's slowms=100 into the developer's
Mongo. Restore happens before the stdout write so the DB is left clean
even if aggregation throws.
2. The query predicate is read from command.filter (not a top-level field).
3. Timestamp-window read of system.profile (ts >= benchmarkStart) instead of
the destructive system.profile.drop().
PII safety: slowest_sample.filter_keys carries only filter KEY NAMES, never
values (the profile doc holds the full predicate, sensitive on prod data).
Absence convention CC-5: empty profile → aggregator returns null → collector
writes nothing → key omitted. Init/error paths exit 0 with no stdout.
Plumbing: MONGO_SLOW_QUERY_MONITOR const + spawnMongoSlowQueryMonitor +
startCollectors push (alongside mongo-ops, gated on mongoUri) in
runner/collectors.js. mongoUri already passes through both drivers. One-line
extension to the metric-keys contract test.
Tests: 15 new aggregator cases (empty/null, mixed op breakdown, slowest by
millis, filter_keys sanitization + no value leak, deterministic tie-break,
threshold/duration passthrough, missing filter, unknown op, large array,
shape contract). Suite 278 -> 293, all green; bench.js list + compare exit 0.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Standalone out-of-process Mongo collector, same delta-vs-baseline pattern
as task 04's mongo_ops. On startup it discovers the app's user
collections via db.listCollections() (excluding system.* and any
non-alphanumeric-leading internal namespaces) and snapshots each
collection's per-index $indexStats — accesses.ops + since. On SIGTERM it
re-snapshots, diffs per index, and emits metrics.mongo_index_usage to
stdout, which stopCollectors' generic JSON drain already ingests (no
special handling).
Output shape: { metric: 'mongo_index_usage', collections: { <name>:
[{ name, ops_in_window, since, key }] } }. ops_in_window is end.ops −
start.ops (what THIS run hit, not lifetime); since is normalized to an
ISO string. Normalization mirrors mongo_ops: an index created/first-
tracked mid-run (no baseline) counts its full end value; a counter reset
(end < start, server restart) uses end; an index dropped mid-run falls
out (only end rows are iterated). Never-used indexes (ops_in_window 0)
are KEPT so dead indexes that cost write-amplification stay visible.
Absence convention CC-5: init failure, no user collections, or zero
index rows → no stdout, so the key is omitted and other metrics are
unaffected. CC-6: the mongodb driver comes through runner/_io.js
(io.MongoClient) for mockability.
The diff math is extracted into a pure runner/index-usage-aggregator.js
(aggregateIndexUsage) covered by 13 unit tests; total suite 278 → 291.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Standalone Mongo connection-pool sampler that polls
serverStatus().connections every second over the benchmark window
(a time-series sampler, NOT a start/end delta — connection counts
rise and fall as VUs connect/disconnect) and surfaces as
metrics.mongo_pool. Mirrors the task 04 mongo_ops collector: spawned
out-of-process by startCollectors, emits aggregated JSON on stdout on
SIGTERM, flows through the generic JSON drain.
REVISIONS.md task 14 fixes, re-verified live against the dev_bundle
Mongo 7.0.16 (serverStatus().connections probe):
- connections.totalClosed does NOT exist in Mongo 7.0 — dropped
total_closed from the output shape.
- connections.available is server-side LISTENER HEADROOM (~800k max
incoming slots), NOT idle pool connections — dropped; it's not a
useful saturation signal. Saturation shows as current ≈ active, so
we sample `active` instead and compare current vs active.
Output shape: current + active as time-series (min/max/avg/end),
total_created as start/end/delta (monotonic counter, only the window
delta is meaningful).
Aggregation (min/max/avg/end + counter delta) is a pure module
(runner/connection-pool-aggregator.js) so it's unit-tested without
MongoClient or a child process; returns null when no samples were
captured (CC-5 absence → caller omits the key). 13 new unit tests.
Files: collectors/mongo-pool-monitor.js (new),
runner/connection-pool-aggregator.js (new),
tests/unit/connection-pool-aggregator.test.js (new),
runner/collectors.js (+spawnMongoPoolMonitor + startCollectors push),
tests/unit/metric-keys-contract.test.js (+mongo_pool).
Gates: npm test 291/291 (was 278), bench.js list clean, bench.js
compare exit 0. Smoke-tested live against mongod:4001.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…mongo_pool
Renders the 3 new Mongo metrics (tasks 12/13/14) from the result JSON.
Layout:
- Mongo connection pool — current/active min/max/avg/end +
total_created start/end/delta (REVISIONS-corrected shape, no
totalClosed, saturation = current vs active).
- Mongo slow queries — by_op count table + slowest sample card
(ns + op + ms + filter_keys + planSummary). Filter values are
redacted at the collector layer (PII safety per task 12 spec);
only key names are surfaced.
- Mongo index usage — one full-width card per collection (one row
per index) with name, key, ops_in_window, tracked-since timestamp.
Wide layout because $indexStats rows have a key spec that needs
real estate.
Each panel guarded with `hasXxx` helpers; absent metrics → omitted
cards (CC-5). Notably mongo_slow_queries IS often absent — the
default 100ms slowms threshold means a clean run reports nothing,
which is the correct signal.
E2E validation on ddp-reactive-light (150 VUs, 33s, oplog driver):
- 11 metric keys in result JSON: app_resources, db_resources, gc,
mongo_ops, ddp_methods, ddp_subscriptions, live_update_propagation,
observer_pool, ddp_messages, mongo_index_usage, mongo_pool.
- mongo_slow_queries correctly absent (nothing >100ms in this workload).
- mongo_pool: 39 samples — current 18→27 avg 24.4, active 5→10
avg 9.7, total_created delta=9 connections.
- mongo_index_usage: taskCollection._id_ index → 3000 ops in window.
Gates: 319/319 tests in ~735ms (+41 from the 3 new aggregator suites);
bench.js list + compare clean.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
New bench-monitor that measures the BYTE SIZE of every DDP message in each direction, surfaced as metrics.ddp_frame_size: in/out size distributions (count + avg/p50/p95/p99/max) plus a per-type byte-sum breakdown. Complementary to task 07's ddp_messages, which COUNTS messages — this measures how big they are, so two configs with identical message counts but different byte volumes (partial- vs full-doc, sockjs vs uws framing) are distinguishable. Separate metric key, not an extension of ddp-message-counter. Per-message bytes via Buffer.byteLength(JSON.stringify(msg), 'utf8') inside the Session.prototype.send wrap (outgoing) and the Meteor.onMessage callback (incoming). REVISIONS.md task 08: the hooks see the structured msg object PRE-serialization, so the serialized DDP JSON length is the canonical wire size (pre-compression; task 09 will cover post-compression separately). Same lazy Session-prototype grab as ddp-message-counter (re-implemented, not cross-imported — monitors stay independent). Field naming (CC-4): byte percentiles carry a _bytes SUFFIX (p50_bytes/p95_bytes/p99_bytes, avg_bytes, max_bytes) rather than the bare p50/p95/p99 form — bare exists to match the shipped event_loop_delay contract whose percentiles are in ms; these are bytes, so the unit suffix is required. Percentiles come from the shared lib/percentiles.js summarize (CC-1). CC-8 high-volume: accumulate in memory + flush via installDumpOnShutdown (raw size arrays + per-type byte sums); the harness-side aggregator (runner/frame-size-aggregator.js) computes percentiles. Per-direction sample arrays cap at 200k (bounds memory on >1M-message runs per the spec's sampling note); per-type byte sums keep accumulating past the cap so byte accounting stays complete. Absence (CC-5): no messages either direction → aggregator returns null → key omitted. Gated entirely on DDP_FRAME_SIZE_OUTPUT. Plumbing: re-export from bench-monitors.server.js; call in server/main.js Meteor.startup after initDdpMessageCounter; DDP_FRAME_SIZE_OUTPUT passthrough in runner/meteor-process.js; prepareFrameSizeOutput + read-aggregate-unlink block in runner/collectors.js; wired through both drivers. One-line contract test extension; README monitors table + file-tree updated. Tests: 13 new aggregator cases (null/empty→null, single value, known [1..100] nearest-rank percentiles, both directions, float avg rounding, by_type passthrough + copy-not-reference, _bytes-suffix shape contract, large array, in-only/out-only). Suite 319 → 332, all green; bench.js list + compare exit 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Standalone Mongo collector that polls currentOp every 250ms for
in-flight change-stream getMore cursors and surfaces as
metrics.mongo_changestream — a time-series cursor_count (min/max/avg/end)
plus a per-namespace breakdown (max + avg per ns). Mirrors the task 14
mongo_pool collector: spawned out-of-process by startCollectors when
mongoUri is set, emits aggregated JSON on stdout on SIGTERM, flows
through the generic JSON drain. Uses the CC-6 mongodb npm driver via
io.MongoClient. Aggregation is a pure module (changestream-aggregator.js)
so it's unit-tested without MongoClient — 13 new tests.
currentOp filter VERIFIED LIVE against the dev_bundle Mongo 7.0.16
(single-node RS) by opening real change streams and watching currentOp.
The REVISIONS.md task 24 filter does NOT work as written — two
corrections, both confirmed live:
1. Field path is cursor.originatingCommand.pipeline, WITH the `cursor.`
prefix. originatingCommand sits under the `cursor` sub-doc of each
inprog entry, not top level. (REVISIONS prose says
"cursor.originatingCommand.pipeline" but its code block dropped the
prefix; the top-level path matches 0, the cursor-prefixed path
matches every change stream.)
2. $elemMatch: { $changeStream: { $exists: true } } THROWS "unknown
operator: $changeStream" (Mongo parses $changeStream as a query
operator inside $elemMatch). The dotted-path form
{ 'cursor.originatingCommand.pipeline.$changeStream': { $exists:
true } } does not throw and matches correctly.
Per REVISIONS: 250ms sampling (1 Hz undercounts — getMores complete
sub-second), NO idleCursors flag (change-stream getMores are active
awaitData ops; idleCursors:true returns 0). Per-namespace keys off
op.ns (stable real id, CC-7). Absence (CC-5): exits 0 with no stdout on
init error → key omitted; under the oplog driver every sample is just 0.
Smoke-tested live: ran the collector against mongod:4001 while holding
3 change streams open (2 on meteor.tasks, 1 on meteor.widgets) → 10
samples, cursor_count max=3, by_namespace {tasks:{max:2}, widgets:
{max:1}}, exit 0.
Files: collectors/mongo-changestream-monitor.js (new),
runner/changestream-aggregator.js (new),
tests/unit/changestream-aggregator.test.js (new),
runner/collectors.js (+spawnMongoChangestreamMonitor + push),
tests/unit/metric-keys-contract.test.js (+mongo_changestream).
Gates: npm test 332/332 (was 319), bench.js list clean, bench.js
compare exit 0.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Standalone out-of-process Mongo collector, same start+end delta pattern
as task 04's mongo_ops. On startup it snapshots
serverStatus().wiredTiger.cache; on SIGTERM it re-snapshots, computes the
per-window page-count deltas and the cache hit ratio, captures the
end-of-run bytes-in-cache gauge, and emits metrics.mongo_wiredtiger to
stdout — which stopCollectors' generic JSON drain already ingests (no
special handling).
cache_hit_ratio = (pages_requested − pages_read_in) / pages_requested,
computed on the DELTAS over the benchmark window (what THIS run hit), not
lifetime counters. The four WiredTiger fields are read by their exact
human-readable string names ("pages requested from the cache", "pages
read into cache", "pages written from the cache", "bytes currently in
the cache") — validated against live Mongo 7.0 per REVISIONS, no renames.
serverStatus is server-wide so the admin DB handle is used and the ratio
reflects the whole mongod's global cache (fine for a Meteor-dominated
workload).
Normalization mirrors mongo_ops: a counter whose end < start (server
restart mid-run) uses the end value; read_in is clamped so the ratio
stays in [0, 1]. Absence convention CC-5: init failure, a non-WiredTiger
storage engine (wiredTiger sub-doc absent), or zero cache traffic
(requested delta 0, which also covers two identical snapshots) → no
stdout, so the key is omitted and other metrics are unaffected. CC-6: the
mongodb driver comes through runner/_io.js (io.MongoClient) for
mockability.
The hit-ratio math is extracted into a pure
runner/wiredtiger-aggregator.js (aggregateWiredTiger) covered by 12 unit
tests; total suite 319 → 331.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Coarse DDP compression ratio per the task 09 spec (the precise per-msg
variant needs meteor-source changes that are out of scope; this ships
the workable approximation). The metric pairs:
- pre-compression bytes: JSON.stringify lengths from the frame-size
monitor's existing dump (task 08).
- post-compression bytes: socket.bytesRead/bytesWritten deltas
across all tracked connections, captured by a new in-app monitor.
The harness aggregator (runner/compression-aggregator.js) consumes BOTH
dumps in stopCollectors — frame-size is read first and its raw parsed
dump is kept in scope for the compression read that follows. If either
dump is absent the metric is omitted (CC-5).
Output shape (`metrics.ddp_compression`):
{
out: { uncompressed_bytes, compressed_bytes, ratio, savings_pct },
in: { uncompressed_bytes, compressed_bytes, ratio, savings_pct },
}
Ratios > 1 (WS framing overhead inflating tiny-msg traffic) pass
through honestly — not clamped. Per-direction ratio is null when
uncompressed bytes in that direction are 0 (divide-by-zero guard).
Implementation notes:
- New compression-tracker.server.js: Meteor.onConnection registers
per-conn baseline socket bytes (resolved via 6 candidate paths
through the Meteor internals; falls back gracefully + warns once
if none match the running Meteor). conn.onClose captures final
bytes; the SIGTERM dump also sums live-connection deltas so
counts stay self-consistent across the periodic snapshot writes.
- All standard plumbing: env-gated init, installDumpOnShutdown,
bench-monitors.server.js re-export, server/main.js init call,
DDP_COMPRESSION_OUTPUT env passthrough, prepareCompressionOutput
in collectors.js, driver wiring, README row + tree, contract test
+1 line.
Tests: tests/unit/compression-aggregator.test.js (10 cases — both
dumps required, all-zero null, typical ratio + savings, divide-by-zero,
ratio>1 honest passthrough, 4-decimal rounding, non-numeric coercion,
shape lock).
Gates: 368/368 (+11 from compression aggregator suite); bench.js list/
compare clean.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Per-cursor observer driver fallback tracker. The existing startup probe
in apps/tasks-3.x/server/main.js measures the driver Meteor picks for
ONE throwaway cursor. This metric measures the driver Meteor actually
selects for EVERY cursor opened during the run — so a benchmark labeled
`oplog` doesn't quietly run 30% polling due to per-cursor fall-through.
Surfaces as `metrics.driver_fallbacks`:
{
metric: 'driver_fallbacks',
total_cursors, no_fallback, configured_first,
fallbacks: { 'changeStreams_to_oplog': N, 'oplog_to_polling': M, ... }
}
REVISIONS.md task 10: do NOT wrap `_selectReactivityDriver` — the
polling fallback path bypasses it (happens at mongo_connection.js:1188).
Instead wrap the connection-instance `_observeChanges` and read the
selected driver off `handle._multiplexer._observeDriver.constructor.name`
(same internal the startup probe uses).
Instance-level wrap (not prototype) is safer here — Meteor normally has
one default mongo connection; wrapping the instance avoids any chance of
double-wrap from other code touching the prototype.
Standard plumbing (env-gated init, installDumpOnShutdown, bench-monitors
re-export, server/main.js init, DRIVER_FALLBACK_OUTPUT env passthrough,
prepare + spawn wiring in collectors.js, driver wiring, README row,
contract test +1 line).
Tests: tests/unit/driver-fallback-aggregator.test.js (8 cases — null on
total=0, pass-through shape, multiple transitions preserved, missing
configured_first → null, non-numeric coercion, defensive copy of
fallbacks object, key lock).
Gates: 376/376 (+8 from driver-fallback aggregator suite); bench.js
list/compare clean.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Renders the 5 new metrics added this round in detail.html + detail.js:
- DDP frame size — per-direction count + avg/p50/p95/p99/max bytes
- DDP compression (coarse) — per-direction uncompressed/compressed
bytes + ratio + savings_pct
- Mongo change-stream cursors — time-series total + per-namespace
- Mongo WiredTiger cache — hit ratio + 4 page counters + bytes_in_cache
- Driver fallbacks — total observe()s + no_fallback + per-transition
counts (e.g. changeStreams_to_oplog: 150)
Layout: 3 new rows of 2 cards each, plus driver_fallbacks alone (it's
the standout signal — putting it on its own row makes the per-transition
breakdown legible). Each panel guarded with hasXxx; absent metrics →
omitted cards (CC-5).
ALSO ships a small aggregator fix for compression: when uncompressed > 0
but compressed = 0, emit null ratio/savings_pct rather than the
nonsense "100% savings" value. That zero-compressed case is the signature
of a failed socket-byte capture (compression-tracker.server.js's
findRawSocket couldn't resolve the underlying TCP socket on this
Meteor/transport combination). Documented honestly via null instead of
a misleading number; +1 test case for the detection.
E2E validation on ddp-reactive-light (150 VUs, 33s, oplog driver):
- 16 metric keys in result JSON (was 11 before this round)
- ddp_frame_size: in 132B avg / out 108B avg; by_type_bytes shows
`added` dominates outgoing (~2 MB of ~3 MB total)
- ddp_compression: emitted with null ratio (socket capture
unresolved — known limitation, not silently 100% anymore)
- mongo_changestream: 157 samples; max=0 (oplog driver in use, no
change-stream cursors live — correct)
- mongo_wiredtiger: 99.86% hit ratio · 91k page requests · 132 reads
into cache · 28 MB in cache (healthy)
- driver_fallbacks: 150/150 cursors fell back changeStreams→oplog
(matches runtime-info probe; configured first never used)
Gates: 377/377 (+1 from new compression test); bench.js list/compare
clean.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
New build-profile driver runs ONE production build with METEOR_PROFILE=1, parses the profile tree once, and emits TWO metrics from it — build_profile (top-N hottest build nodes by self_ms + long-tail roll-up) and plugin_compile (per-compiler-plugin time). One build serves both, avoiding a second 60-180s build per benchmark. Spec redesigned around METEOR_PROFILE=1 because `meteor build --verbose` emits zero timing. Exploration of a REAL build (tasks-3.x) corrected several brief assumptions, documented in the parser header and captured as a regression fixture (tests/unit/fixtures/meteor-profile-sample.txt): - Profile goes to STDOUT, not stderr. - Each line is `| <indent><name><pad><N> ms[ (<count>)]`. Depth is encoded by 3-column box-drawing groups (│ / ├─ / └─ / spaces); dot-leader vs space padding is cosmetic (has-children), not depth. ms/count carry thousands commas; count is optional on synthetic "other X" lines. - Plugin nodes are top-level entries literally named `plugin <name>` (plugin ecmascript / typescript / static-html / meteor verified live). - total_ms is read from the authoritative `(meteor#1) Total: N ms` line, NOT a tree sum (tree timings nest/overlap → summing double-counts). - The trailing "Top leaves:" block duplicates tree entries and is skipped so it can't double-rank the top-N. Files: runner/meteor-profile-parser.js (pure, defensive — unmatched lines skipped, truncated output parses partially, never throws); runner/build-profile-aggregator.js (rank by self_ms, top-N default 5, children_ms = descendant self_ms sum for top nodes, long_tail_ms = total - top_n_total clamped at 0, null on empty); runner/plugin-compile- aggregator.js (filter `plugin ` prefix, group by stable name, sum on recurrence, null when none). drivers/build-profile.js captures stdout (32MB maxBuffer), resets first for a clean compile, parses partial output on nonzero exit, pushes only non-null aggregates (CC-5). Wired via drivers/index.js, cli/run.js pickDriver (driver: 'build-profile'), bench.config.js scenario. Two contract-test keys under a new Phase D header. Output shapes: build_profile = { metric, total_ms, top_nodes:[{name,self_ms,children_ms, count}], top_n_count, top_n_total_ms, long_tail_ms } plugin_compile = { metric, total_plugin_ms, plugins:{<name>:{self_ms,count}} } Tests: 36 new (15 parser incl. real-fixture assertion, 12 build-profile-agg, 9 plugin-compile-agg). Suite 377 -> 413, all green; bench.js list shows the build-profile scenario; bench.js compare exit 0. E2E left to the consolidated run. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Renders the 2 new build-time metrics from the Phase D round in
detail.html + detail.js. Both are full-width single-column cards
(below the DDP/Mongo panels) since the table rows can be wide:
- Build profile (top hot nodes) — table of {name, self_ms,
children_ms, count} for the top-N hot nodes from the
METEOR_PROFILE=1 tree. Footer shows total / top-N total / long
tail breakdown.
- Per-compiler-plugin time — table of {plugin, self_ms, count}
sorted by self_ms desc. Footer shows total + plugin count.
Each panel guarded with hasXxx; absent metrics → omitted cards (CC-5).
Build-time metrics only appear on `build-profile` scenario runs (not
DDP scenarios), so the cards naturally hide on the runs that don't
emit them.
ALSO marks task 22 (hot-reload) as deferred in
.claude/metrics-tasks/README.md — the first parallel attempt stalled
mid-implementation (4 REVISIONS fixes: Playwright install + chromium,
hmr-probe module, console marker listener, SIGINT cleanup handler).
The agent worktree + branch were removed cleanly; partial work
preserved in git history is none (no commit landed). Task can be
re-attempted with fresh context later.
E2E validation on ddp-reactive-light's sibling `build-profile`
scenario (19.7s wall, of which 9.7s was tracked in the profile):
- top hot node: Babel.compile 3543 ms (234 invocations)
- long tail: 2128 ms (22% of tracked time)
- 4 plugins detected (ecmascript, typescript, static-html, meteor)
with tiny times — this app's source is small + isopacks dominate
so plugin-compile workload is light. Metric infra works; bigger
apps would stress it more.
Gates: 413/413 tests in ~845ms; bench.js list/compare clean.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…heme toggle Phase A of v2 redesign. Tailwind 4 is wired in via the @tailwindcss/cli watcher writing to client/main.css (Meteor's standard-minifier-css picks that up). PostCSS through the Meteor pipeline turned out flaky with rspack, so the CLI approach is what actually applies utilities. - _tw/main.tailwind.css is the source: imports tailwindcss, declares the class-based dark variant, maps Geist/JetBrains Mono into @theme, and scans both client/ and imports/ for utilities. - client/main.html drops the Bootstrap CDN, loads Geist + JetBrains Mono from Google Fonts, applies sans + dark canvas tokens on <body>. - client/main.js applies the saved theme class on Meteor.startup so there's no FOUC. - layouts/main.html is the new sidebar: brand block (Meteor Benchmark + v2.4.1-stable), Runs/Compare/Trends primary nav with an indigo active left-border, dimmed Settings/Documentation, ☀/☾ theme button + user pinned to the bottom. Mobile gets a thin top bar instead. - layouts/main.js owns the toggle: click → swap <html> class + localStorage('meteor-bench-theme') + a reactive var so the icon updates instantly. Pages still use Bootstrap markup → unstyled until Phase B–E rewrite them. Sidebar + dark canvas verified at http://localhost:4000/. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Phase B. The Runs landing is just one dense table now — no auto-compare
panel, no Versions strip; those were v1 over-design. Stitch design
implemented closely.
Markup:
- "Runs" heading + "{n} runs · {n} scenarios" counter at right.
- Filter row: scenario <select> + free-text tag search + a clear-✕
button that only renders when a filter is active.
- Table: When/Version/Tag/Scenario/Wall/CPU/RAM/GC pause/Δ vs prev,
+ an ↗ column linking to detail. Tabular-nums via the `font-tabular`
helper. Row hover tints neutral-50/900.
- Empty state inlines the push command. Loading state too.
- "Load more" widens the publication limit by 30 per click.
JS:
- `whenAgo` for the When column (s/m/h/d, then date).
- `versionLabel` falls back to runtime.channel → "local" because most
local result JSONs report `meteor.version: "system"`.
- Per-row Δ is computed by walking each scenario chronologically and
comparing the current run's wall_clock_ms to its predecessor in the
same scenario. <5% = neutral grey, regression = orange-500,
improvement = green-500. Threshold bands match the v1 compare logic
so Runs ↔ Compare colors stay consistent.
- statusBadge is gone — it was a fake metric (raw wall time → green
badge regardless of scenario), which is misleading. Δ vs prev is the
honest signal.
Foundation tweaks needed along the way:
- Body/<html> background + text color now live in _tw/main.tailwind.css
with an html.dark override, because Meteor strips classes on <body>.
- Theme toggle moved from #themeToggle (id) to .js-theme-toggle (class)
so the mobile and desktop buttons can both wire up to one click
handler without DOM-id collisions.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Phase C. Compare is now a regression scoreboard sorted by absolute Δ%
descending — biggest movers first, "within noise" last.
Markup:
- Top filter row: Scenario · Run A · ⇄ swap · Run B. Run selects show
"{version} · {tag} · {scenario} · {when}" per option so you can
identify a run at a glance. Picking a scenario narrows both A/B
option lists and clears prior picks.
- Headline strip: "{A} → {B}" mono label + summary counts colored per
tier (regressions orange, improvements green, neutral grey) + a
"hide within-noise" checkbox.
- Scoreboard table: Metric · A · B · Δ abs · Δ % · Status pill ·
↗A ↗B per row. Click a row to inline-expand a drilldown sub-table
(per-method for ddp_methods, per-op for mongo_ops, etc.) — wired via
ReactiveDict keyed on metric path.
- Footer line lists metrics that only exist in one of the two runs
("only in A" / "only in B").
JS:
- Single `M` array of metric extractors so adding a new comparable
metric is one entry. drilldown() is optional and feeds the expand
panel.
- classify(delta, unit): hardcoded bands <5% neutral / <25% warn /
≥25% big. Sign-direction is unit-aware — for ms/mb/pct/bytes
(higher = worse) positive Δ becomes regression; for count metrics
it's "info" (no value judgement, just movement).
- pctDelta + fmt handle nulls and unit-specific formatting.
- Subscribes runs.recent(200), filters scenario-side client-side from
the publication. No new pub needed.
Bootstrap is now fully gone from this page — every class is Tailwind
or one of our two helpers (font-tabular, font-mono via the Tailwind
mono var). No Bootstrap collapse component either; the expand is just
a {{#if expanded}} guard and a ReactiveDict toggle.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Phase D. Single run drill-down redesigned per the Stitch reference:
breadcrumb / header band / sticky left nav-pill rail / grouped sections.
Markup:
- Breadcrumb: Runs / {scenario} / {tag-as-mono}.
- Header band: big mono {tag} as the h1, then a small inline row with
the version pill + (optional) sha + scenario link + date + wall
clock. Right side: Compare ▾ (indigo primary) + Prev run (ghost).
- Verdict line under the header: "▲ +18% wall vs prev (release/3.4)"
colored per band. Computed against the previous run of the same
scenario.
- Two-column body: sticky left rail with Overview/DDP/Mongo/Observer/
Build/Not in run anchor pills (each section is conditional and the
pill is only emitted when the section has anything to show), and the
right column with each section as an h2 + 2-up grid of metric cards.
- "Not in this run" footer lists every absent metric family as muted
mono pills, so analysts can tell apart "we know this was absent" vs
"we just forgot to ship it".
JS:
- Kept every existing `hasXxx` / formatting helper logic from the prior
detail.js — same `this.metrics.<key>` paths, same null-guarding via
absence convention. Just adapted the return shape to feed a single
`metricCard` partial (label/value pairs OR a tableHeaders + tableRows
cells array). This keeps the markup tiny and consistent across the
~14 metric families.
- Section flags (hasDdpSection / hasMongoSection / hasObserverSection
/ hasBuildSection / hasMissingSection) drive the rail + section
visibility from one anyOf() check per group.
- verdict() / prevRunId() walk the same-scenario predecessor for the
Δ banner and the ← Prev run button.
The metricCard partial is the only piece of shared infrastructure on
the page — one component covers both "key/value table" and "real
multi-column table" + optional header badge + footer note.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Phase E.
Trends:
- One row of filters: Scenario · Metric (with <optgroup> by family —
Runtime / DDP / Mongo / Build) · Segment by [version | tag] toggle
· Date range (7/30/90/all) · right-aligned "n runs · m versions"
counter that reflects the live filter.
- One full-width Chart.js line chart in a card. Brand palette
(indigo / green / orange / teal / pink), cycled per segment key so
colors are stable across re-renders.
- Vertical dotted lines mark each version first-appearance, with a
small mono label at the top edge ("local →"). Implemented via an
inline Chart.js plugin reading versionBoundaries.
- Tabular numerals on both axes via JetBrains Mono. Grid + tick
colors react to the dark class on <html>.
- Point click → /run/:id of that run.
- Custom legend rendered below the chart from the same chartStats
ReactiveVar that powers the counter, so legend and counts can never
drift.
- canvas lives outside {{#if hasData}} (toggled via a `hidden` class)
because Blaze removes the canvas when the if flips false, and the
autorun otherwise can't find the canvas at the moment data arrives.
Scenario:
- Breadcrumb (Runs / {scenario}), big mono scenario title with run
count + driver as a one-line caption, and primary Compare runs +
ghost View trends buttons at the right.
- Two cards in a 50/50 grid: About this scenario (prose) and At a
glance (Driver badge / Virtual users / Duration / Requires browser).
- Technical details collapsed accordion (▸/▾ chevron); body holds the
long-form technical description with inline mono code spans. State
via a ReactiveVar so we don't pull in Bootstrap collapse.
- "Runs for this scenario" — same dense table shape as Runs landing
but filtered by scenario, no Δ column (the scenario itself is the
constant, so prev-run deltas live on Compare/Detail).
- The static SCENARIOS dict was inlined from the prior file. Added a
build-profile entry for completeness.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…Detail Local result JSONs ship meteor.version = "system" and meteor.sha = "unknown" when no Meteor source was wired in. Showing those as literal table values is noise. Now we just omit the row when the value is one of those sentinels, matching how the existing versionLabel() helper handles them in the header pill. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
User-facing labels only — JSON contract (wall_clock_ms), dropdown option values, and internal metric keys are unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Mongo accepts dots in object keys server-side, but minimongo rejects them on the client — a single dotted key (e.g. metrics.mongo_changestream indexed by `<db>.<collection>`) breaks the whole runs publication, so affected metrics silently vanish from the dashboard. Recursively replace '.' with '_' in keys at insert time. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Metric cells render via {{{value}}} (raw HTML) so the <code>/<span>
wrappers take effect. Run data (mongo namespaces, index/plugin/node
names, tags, sha) is machine-generated but untrusted (pushed over DDP
with the bench API key), so HTML-escape every interpolated value to
prevent stored XSS. Also intercept sidebar `#anchor` clicks — FlowRouter
swallows them — and scroll to the section manually.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
via shell). Concat instead of overwrite.
…ear + scenario meta
- runs.js: stop sanitizeKeys from recursing into Date (it rebuilt the
timestamp as {} → "Invalid Date"); add authenticated runs.clear method
- dashboard.html/js: remove the "Δ vs prev" column + its dead delta calc
- scenario.js: add ddp-reactive-extended metadata (fixes "Unknown scenario")
- main.html: version label v2.4.1-stable → v0.1.0-beta
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- bench.js/cli: new `bench.js clear --confirm` dashboard command (mirrors push/baseline over DDP) to wipe all runs via the runs.clear method - bench.config.js: register ddp-reactive-5min and ddp-reactive-extended - artillery/ddp-reactive-extended.yml: ~7-min sustained DDP load profile (2→5→10 VU/s) sized for a capable machine, for the observer-driver × transport benchmark matrix - artillery/ddp-reactive-5min.yml: track the existing 5-min profile Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Refactor: from ad-hoc benchmark scripts to a Meteor Benchmark Platform
live demo: https://meteor-benchmarks.us.galaxycloud.app/
Summary
This PR rewrites the repository from a loose collection of benchmark scripts, Playwright tests and hand-captured
.logfiles into a structured, reproducible benchmark platform for Meteor.What used to be "run a script, eyeball a log, commit the output" is now a single CLI (
bench.js) with a modular pipeline — drivers → collectors → aggregators → reporters — backed by a Meteor instrumentation package, a results dashboard, a unit-test suite, and CI workflows that run benchmarks on PRs, nightly, and across transport/observer matrices.The net line count drops (−38k) because ~13k lines of stale benchmark logs and the entire Meteor 2.x app were removed, while the new harness, monitors, dashboard and tests were added.
Why
The old setup (
main) had real limitations:benchmarks/**/*.log. No schema, no comparison tooling, no regression gate.The goal of this branch was to make benchmarking a first-class, automatable workflow: one command to run a scenario, structured JSON out, automatic regression detection against a baseline, and a dashboard to visualize trends across Meteor versions and transport/observer configurations.
What changed
1. New CLI & harness architecture
A thin
bench.jsentry point dispatches to focused modules:Subcommands:
node bench.js listnode bench.js run --scenario X --app Y --tag Znode bench.js compare --baseline A --target Bnode bench.js push --result file.jsonnode bench.js baseline --scenario X --run-id Ynode bench.js bundle-delta [--limit N]2.
bench-monitorsMeteor package (server-side instrumentation)A new in-app package (
apps/tasks-3.x/packages/bench-monitors/) injects lightweight, opt-in server instrumentation that emits parseable metrics consumed by the harness:3. Metric collectors & aggregators (tasks 01–24)
A broad set of metrics, each with a collector (sampling) + aggregator (summarizing) + unit tests + dashboard panel:
METEOR_PROFILE=1build profile (hot nodes) + per-plugin compile time, bundle-size delta4. Results dashboard (
apps/dashboard/) — design v2A new Meteor app to visualize runs, built on a Tailwind design system (v2):
meteor-benchmarks.us.galaxycloud.app.5. Runtime observability & configuration matrix
[runtime-info] observer_driver=…/transport=…on startup; the harness captures these from stderr into each result'sruntimefield, so every pushed run is self-describing.--meteor-version) or a local checkout (--meteor-checkout), mutually exclusive.{changeStreams, oplog} × {sockjs, uws}comparison on the dashboard.6. CI workflows
benchmark-pr.yml— run benchmarks on PRs (with hardenedclient_payloadhandling)benchmark-nightly.yml— scheduled runsbenchmark-runtime-matrix.yml— the 2×2 observer × transport matrixbenchmark-transport.yml— sockjs vs uws7. Test suite
~40
node:testunit-test files covering every aggregator, the regression detector (incl. zero-baseline / NaN / Infinity edge cases), CLI commands, the meteor-source resolver, runtime-info extraction, and a metric-keys contract test to keep collector output and the dashboard in sync.8. Cleanup / removals
apps/tasks-2.x) — focus is on Meteor 3.x..logfiles underbenchmarks/.packages/into the app; pruned obsolete files and tightened.gitignore.Migration notes
"type": "module").tasks-3.x.benchmarks/**/*.logartifacts were intentionally removed; reproduce vianode bench.js runinstead.How to test
🤖 Generated with Claude Code