perf(runtime): overlap AICore handshake wakeups; batch the release barrier by ChaoWao · Pull Request #1214 · hw-native-sys/simpler

ChaoWao · 2026-06-30T11:16:06Z

Summary

handshake_all_cores (run once per run_prepared, on the orchestrator thread,
inside the preamble device phase) ran two costs serially that can be
parallelized. Both AICore cores wake and advance their handshake phases
independently, so blocking on one core before looking at the next made their
latencies sum instead of overlap.

Change 1 — sweep Step 2 instead of per-core blocking

The old loop did, for each core: while(wait aicore_regs_ready) → init regs →
while(wait aicore_done), fully draining core i before touching core i+1.
That serializes 72 cores' wakeup waits. Replaced with two phase-batched
sweeps: poll every still-outstanding core per pass and service whichever are
ready. The per-core wakeup waits now overlap (≈ slowest single core, not the
sum). The handshake flags are GM reads, not the nGnRE MMIO register
window, so sweeping is not subject to the serial-LDR constraint that RegId::COND
polling is.

Change 2 — batch the Step 1 release barrier

Step 1 raised aicpu_ready inside the per-core loop with an
OUT_OF_ORDER_STORE_BARRIER() each iteration (71 redundant barriers). The
task pointers are now published with one barrier, then aicpu_ready is raised
for all cores. One barrier suffices — AICore only relies on "all task stores
globally visible before any aicpu_ready store".

Applied symmetrically to a2a3 and a5.

Measured — `preamble` device phase only

qwen3-14B, 3.5k-context decode, a2a3 onboard, PTO2_RING_TASK_WINDOW=524288,
averaged over the 19 decode steps:

`preamble`	value
before (per-core blocking + per-iter barrier)	329 µs
after sweep Step 2	213 µs
after sweep + batched Step 1 barrier	150 µs

Net −179 µs/step (−54%). Output token IDs identical to baseline.

Why it stops at ~150 µs (the residual is a physical / protocol floor)

The remaining preamble is dominated by costs no AICPU-side code change can
remove:

AICore launch + NoC cold-wakeup — rtKernelLaunchWithHandleV2 lazily
loads the AICore kernel binary onto the device on first launch, and each core
must be woken across the NoC before it reaches the handshake point. The sweep
already overlaps these waits to the slowest single core; that core's wakeup
is a hardware floor.
One structurally-required GM round-trip to bind logical block_idx ↔
runtime-assigned physical_core_id. Three follow-on ideas to move this off
GM were each ruled out by verified hardware constraints:
- Core initializes its own dispatch regs (to drop the round-trip): AICore
  cannot write DATA_MAIN_BASE / FAST_PATH_ENABLE — the SPR write is
  rejected by the CCEC backend and an MMIO STR to the SPR window hangs the
  chip (.claude/rules/ascend.md, docs/hardware/mmio-performance.md). Reg
  init must stay on the AICPU.
- Report physical_core_id / signals over COND instead of GM: to poll a
  core's COND the AICPU needs that core's reg_addr = regs[physical_core_id]
  — which is exactly the binding the handshake is establishing. The register
  channel cannot bootstrap the binding it depends on.
- Acknowledge reg-init over DMB instead of a GM flag: the core must
  distinguish "AICPU has not initialized my regs yet" from "done", which
  needs a host-preclearable "not ready" sentinel. GM flags have one (host
  zeroes workers[] each run); SPRs (DMB/COND) do not — the host cannot write
  them, so a stale value would be misread. The init-ack must stay a GM flag.

So preamble's floor is: AICore binary load + NoC wakeup + one GM round-trip for
the logical↔physical binding. The accumulation (sweep) and the redundant
barriers (batch) — the parts that were software overhead — are removed here.

Testing

a2a3 onboard, output tokens identical to baseline across the run.
a2a3 onboard correctness across the handshake → dispatch → completion
paths: vector_example, async_notify_demo, sdma_async_completion_demo,
paged_attention, multi_round_paged_attention, batch_paged_attention — all pass.
a5 onboard (no a5 silicon on this box — relies on CI; change is structurally
identical to a2a3).

gemini-code-assist

Code Review

This pull request refactors the core handshake process in scheduler_cold_path.cpp (for both a2a3 and a5 runtimes) by splitting it into two non-blocking sweeps (Sweep A and Sweep B) to overlap core wakeups and reduce latency. The review feedback highlights a correctness bug where core_exec_states_[i].reg_addr is left uninitialized if validation fails during Sweep A, which prevents emergency_shutdown() from correctly cleaning up initialized registers. To resolve this, reg_addr should be populated immediately in Sweep A, and the redundant assignment in Sweep B can be removed.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

coderabbitai · 2026-06-30T11:18:41Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: cdfd98b1-269d-492f-afd8-8808c340f3a0

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

SchedulerContext::handshake_all_cores in both the a2a3 and a5 variants is refactored from a serial per-core blocking loop into a batched two-phase pattern: Step 1 publishes all payload pointers with a single barrier then asserts aicpu_ready for all cores with a second barrier; Step 2 splits into two out-of-order polling sweeps (Sweep A: regs_ready/validate/init/ack; Sweep B: done/latch state).

Changes

Out-of-order sweep handshake refactor

Layer / File(s)	Summary
Batched payload publication (Step 1) `src/a2a3/.../scheduler_cold_path.cpp`, `src/a5/.../scheduler_cold_path.cpp`	Both files change Step 1 to write all per-core task pointers, issue one store barrier, set `aicpu_ready=1` for all cores, then issue a second barrier—removing the previous per-core barrier between task store and `aicpu_ready`.
Sweep A: regs_ready polling, validation, and ack `src/a2a3/.../scheduler_cold_path.cpp`, `src/a5/.../scheduler_cold_path.cpp`	Serial `regs_ready→init→ack` block replaced with a sweep that polls `aicore_regs_ready` across all cores in repeated passes, validates `physical_core_id`, calls `platform_init_aicore_regs`, sets `aicpu_regs_ready=1` with a barrier per core, tracks completion via `regs_phase_done[]`, and defers `emergency_shutdown`/`return -1` until after the full sweep if `handshake_failed` is set.
Sweep B: done polling and state latch `src/a2a3/.../scheduler_cold_path.cpp`, `src/a5/.../scheduler_cold_path.cpp`	Second sweep polls `aicore_done` across all cores and latches `reg_addr`, `cond_ptr`, `worker_id`, `core_type` into `core_exec_states_[i]`, populates `aic_worker_ids_`/`aiv_worker_ids_` by `core_type`, tracked with `done_phase_done[]`.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 Two sweeps now dance where one loop stood,
barriers batch as they always should.
Regs_ready, done—no more waiting in line,
out-of-order polls let all cores shine.
The rabbit hops fast, no serial delay! 🚀

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main performance change: overlapping AICore wakeups and batching the release barrier.
Description check	✅ Passed	The description is directly related to the changes and explains the handshake sweep and barrier batching in detail.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp`:
- Around line 762-781: The failure path in the scheduler cold-path
initialization leaves some initialized register addresses untracked because
`reg_addr` is only stored in Sweep B, so `emergency_shutdown()` can miss regs
set up before an invalid `physical_core_id` triggers `handshake_failed`. Update
the `scheduler_cold_path.cpp` flow around the `reg_addr` initialization and
`emergency_shutdown()` handling so every successfully initialized `reg_addr` is
persisted immediately in `core_exec_states_[i].reg_addr` before any possible
early return, ensuring shutdown can deinitialize all regs consistently.

In
`@src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp`:
- Around line 766-785: The reg address for each initialized core is being
recorded too late, so the failure path in `scheduler_cold_path.cpp` can miss or
reuse stale state after `platform_init_aicore_regs(reg_addr)` has already run.
Move the assignment of the initialized reg address into the same Sweep A path
where `platform_init_aicore_regs` and `hank->aicpu_regs_ready` are set, using
`core_exec_states_[i].reg_addr` so `emergency_shutdown` and the later deinit
logic can see the correct value. Keep the update paired with the existing core
state writes in the loop that handles `physical_core_id`, `reg_addr_of`, and
`regs_phase_done`.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 45bbc16f-c2ed-470f-8f92-4b7d16d5d40b

📥 Commits

Reviewing files that changed from the base of the PR and between 11d03d9 and 445329d.

📒 Files selected for processing (2)

src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp
src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp

ChaoZheng109 · 2026-07-02T09:31:57Z

Follow-up: collapse the handshake to a single round-trip (`c56f61d5`)

Building on the sweep/barrier work here, this commit removes the second
AICPU↔AICore round-trip entirely rather than just overlapping it.

What & why

The old handshake ran two round-trips per run:

RT1 — AICore reports physical_core_id (aicore_regs_ready)
AICPU opens the core's register window and acks (aicpu_regs_ready)
RT2 — AICore confirms up (aicore_done)

RT2 existed only because reg-init needed physical_core_id first. But the
only thing the handshake fundamentally must exchange is each core's
physical_core_id (the block_idx → regs[physical_core_id] binding the
AICPU dispatches through). So:

AICore now reports {physical_core_id, core_type, aicore_done} in one
write, then waits for its window to open by polling its own
DATA_MAIN_BASE SPR.
AICPU sweeps once on aicore_done, opens the window (which is the
ack), and latches. Sweep A/B, the aicpu_regs_ready ack, and the two
*_regs_ready Handshake fields are gone.

Hardware facts validated on-silicon (a2a3, probe on 72 cores)

A kernel launch resets DATA_MAIN_BASE to 0 (does not persist the
prior deinit's IDLE) — so 0 is an unambiguous "window not open yet"
sentinel, and the AICPU's IDLE write (non-zero) is the open signal.
The pre-open SPR read is a plain MOV DATA_MAIN_BASE — it does not
hang (72/72 cores, all runs completed).

Measured — `handshake_all_cores` only (a2a3 onboard, `paged_attention`, 20 rounds, n=80 real handshakes)

metric	before (Sweep A/B + RT2)	after (single sweep)	Δ
warm / steady-state (min)	29.66 µs	14.10 µs	−52%
median	46.18 µs	36.15 µs	−22%
avg	54.67 µs	34.31 µs	−37%

Correctness

a2a3 onboard — 9/9 pass: vector_example, paged_attention (72-core),
async_notify_demo, multi_round_paged_attention (per-run re-handshake),
batch_paged_attention, mixed_example, spmd_basic, spmd_sync_start,
spmd_sync_start_stress. Output tokens identical.
a2a3 sim — 6/6 pass (incl. multi_round).
Applied symmetrically to a5 (no a5 silicon here; relies on CI
st-onboard-a5 / st-sim-a5).

…rrier handshake_all_cores ran two costs serially that can be parallelized: 1. Step 2 blocked on core i (wait aicore_regs_ready, init regs, wait aicore_done) before looking at core i+1, so the 72 AICore cores' wakeup latencies summed. The cores wake and advance independently, so this is now two phase-batched sweeps (poll every outstanding core per pass, service the ready ones): the per-core wakeup waits overlap instead of accumulating. The handshake flags are GM reads, not the nGnRE MMIO reg window, so sweeping is not subject to the serial-LDR constraint that COND polling is. 2. Step 1 raised aicpu_ready inside the per-core loop with a barrier each iteration (71 redundant barriers). The task pointers are now published with one barrier, then aicpu_ready is raised for all cores — one barrier suffices since AICore only relies on "all task stores visible before any aicpu_ready". Measured (qwen3-14B 3.5k decode, a2a3 onboard, PTO2_RING_TASK_WINDOW=524288): preamble 329us -> 150us/step. Output tokens identical. The residual ~150us is the physical floor: AICore launch (rtKernelLaunch lazy binary load) + NoC cold-wakeup, plus one structurally-required GM round-trip to bind logical block_idx <-> runtime-assigned physical_core_id (the register channel cannot bootstrap that binding — it needs the physical core id the binding is establishing, and has no host-preclearable "not ready" sentinel). Applied symmetrically to a2a3 and a5. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

handshake_all_cores ran two AICPU<->AICore round-trips per run: RT1 discovered physical_core_id (aicore_regs_ready), the AICPU then opened the core's register window and acked (aicpu_regs_ready), and RT2 confirmed the core was up (aicore_done). The second round-trip existed only because reg-init depended on physical_core_id. The only information the handshake fundamentally must exchange is each core's physical_core_id -- the logical block_idx -> physical register-window binding the AICPU needs to dispatch via regs[physical_core_id]. Everything else was derivable or an artifact of the reg-init/id chicken-and-egg. Collapse to one round-trip: - AICore reports {physical_core_id, core_type, aicore_done} in a single write, then waits for its window to open by polling its own DATA_MAIN_BASE SPR. A kernel launch resets that SPR to 0 (verified on a2a3 silicon, 72 cores); the AICPU writes IDLE (non-zero) as it opens FAST_PATH, so a non-zero read means the window is open. The pre-open read is a plain SPR MOV -- it cannot hang. - AICPU sweeps once on aicore_done, validates the id, opens the window (which is itself the ack), and latches reg/type. Sweep A/B, the aicpu_regs_ready ack, and the aicpu_regs_ready / aicore_regs_ready Handshake fields are removed. Applied symmetrically to a2a3 and a5. Measured (a2a3 onboard, paged_attention, handshake_all_cores only): avg 54.7 -> 34.3us (-37%); warm steady-state 29.7 -> 14.1us (-52%). Output tokens identical; a2a3 onboard + sim regression set green.

ChaoZheng109 · 2026-07-03T01:00:54Z

CI triage — `st-sim-a5` timeout is pre-existing, unrelated to this change

The only red check is st-sim-a5 (macos fail, ubuntu cancelled by matrix fail-fast). Root cause from the macos log:

13:37:16  spmd_sync_start_stress starts
13:43:45  [gw0][93%] PASSED spmd_sync_start_stress   (~6m29s)
13:43:45  [pytest] TIMEOUT: session exceeded 600s (10min) limit  → exit 124

All tests PASS — this is a session wall-clock overrun, not a correctness failure. It is not introduced by the handshake change:

st-onboard-a5 and st-onboard-a2a3 both pass (the a5 change is functionally correct on silicon).
st-sim-a2a3 passes; only a5 sim overruns.
The a5 sim suite grew yesterday — test(dfx): mirror a2a3 DFX scene tests to a5 #1246 (test(dfx): mirror a2a3 DFX scene tests to a5, merged 2026-07-02) added 5 DFX scene tests (args_dump, l2_swimlane ×2, pmu, scope_stats). Combined with the ~6.5-min spmd_sync_start_stress tent-pole, the session now sits right at the 600s budget.
Under this change spmd_sync_start_stress ran faster than on the base (6m29s vs ~7m46s previously), so the handshake work does not slow it down.

This is a repo-level a5-sim CI-budget issue (suite size vs the 600s session cap), best addressed separately (trim/shard the a5 sim stress test or raise the session budget) rather than in this handshake PR. Rerunning once to confirm it is a hard budget overrun and not a one-off slow runner.

ChaoZheng109 · 2026-07-03T01:38:33Z

`preamble` measurement (STRACE) — clean toggle on the current tree

Measured the device_wall.preamble device phase via STRACE (simpler_setup.tools.strace_timing) on qwen3-14B 2-layer decode, a2a3 onboard, PTO2_RING_TASK_WINDOW=524288, --rounds 20, warm average (excluding the round-0 cold launch). Only this commit was toggled on the current branch — same binary, same example, same box:

	baseline (Sweep A/B + RT2)	this change (single sweep)	Δ
warm avg	135.5 µs	88.2 µs	−47.3 µs (−35%)
warm min	122.0	81.8
warm max	148.0	96.0
cold (round 0)	336.1	308.7	−27 µs

The −47 µs is larger than the isolated handshake_all_cores delta measured on paged_attention (−20 µs) because qwen3 decode runs the full core set: the removed RT2 round-trip + the per-core reg-init ack scale with core count, and preamble is handshake-dominated.

On the earlier "150 µs" in the PR description vs the 135.5 µs baseline here: it is the same 2-layer example (#1088, added 2026-06-25, predates this PR), so this is not an example change. The difference is (a) this PR's branch is now rebased onto ~28 newer main commits, several of which touch the per-run / preamble / arena path (#1242 per-run simpler_run state lifecycle, #1193 prebuilt-arena image caching, #1216/#1227 Runtime split + HostApi refactors), and (b) run-to-run jitter — the warm baseline here spans 122–148 µs (σ ≈ 7 µs), so 150 µs sits at ~2σ, within the same band. The confound-free figure is the toggle above (135.5 → 88.2 µs on identical inputs).

ChaoZheng109 · 2026-07-03T01:48:45Z

What the `preamble` phase actually covers (it is not only the handshake)

preamble brackets the entire per-run AICPU init() on the device, not just the handshake. In aicpu_execute() (aicpu_executor.cpp:878):

{
    AicpuPhaseScope preamble(AicpuPhase::Preamble);
    g_aicpu_executor.init(runtime);   // everything below
    while (!init_done_) { ... }        // the other AICPU threads spin here
}

init() → SchedulerContext::init() runs, in order:

AICPU thread-pool init barrier — the first thread runs init; the other aicpu_thread_num - 1 threads spin until init_done_.
handshake_all_cores() — the AICPU↔AICore discovery handshake (what this PR changes). The AICPU blocks here until every core wakes across the NoC and reports, opens each core's register window, and (pre-PR) ran the second ack round-trip. The structural floor lives inside this wait: AICore kernel binary load on first launch + per-core NoC cold-wakeup.
assign_cores_to_threads() — cluster-aligned round-robin assignment of the discovered cores to scheduler threads.
Profiling-subsystem init — l2_swimlane_aicpu_init / pmu_aicpu_init / dump_args_init / dep_gen_aicpu_init, only when the corresponding profiling flags are set (OFF in the numbers above).
Task-counter init — read the total task count from PTO2 shared memory.
Per-core payload setup — payload_per_core_ reset + SPMD sub_block_id wiring.

So preamble = thread-barrier + handshake + core-assignment + (profiling init) + task-counters + payload setup. The handshake is the dominant, blocking component — it is the part that stalls on the NoC wakeup + protocol round-trips — so this PR's handshake change moves preamble by ~the handshake delta, while steps 1/3/5/6 are fast and unchanged.

If it's useful I can re-instrument handshake_all_cores on the qwen3 workload to give the exact handshake-vs-rest split inside the 135.5 → 88.2 µs (the way the isolated handshake_all_cores was measured on paged_attention: 54.7 → 34.3 µs).

ChaoZheng109 · 2026-07-03T01:59:33Z

Confirming the `preamble` win is exactly the handshake win

To verify that the preamble reduction is fully explained by the handshake change (and not by some other phase), I instrumented handshake_all_cores() with a timer and measured it alongside preamble on the same qwen3-14B 2-layer decode workload (a2a3 onboard, --rounds 20, warm avg n=19):

phase	baseline (Sweep A/B + RT2)	this change (single sweep)	Δ
`handshake_all_cores`	88.2 µs	41.3 µs	−46.9 µs
`preamble` (clean, un-instrumented)	135.5 µs	88.2 µs	−47.3 µs

Δhandshake (46.9 µs) ≈ Δpreamble (47.3 µs) — they match to within 0.4 µs, so the entire preamble improvement comes from the handshake.

Decomposition check — the non-handshake part of preamble (thread-pool init barrier + assign_cores_to_threads + task-counter init + payload_per_core_ setup) is preamble − handshake:

baseline: 135.5 − 88.2 = 47.3 µs
this change: 88.2 − 41.3 = 46.9 µs

i.e. the non-handshake remainder is unchanged (~47 µs both sides), as expected — this PR only touches the handshake. Handshake share of preamble: 65% → 46% (it shrank; the rest held constant).

(The handshake timer is a temporary LOG_INFO_V9 used only for this measurement — not part of the PR.)

gemini-code-assist Bot reviewed Jun 30, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 30, 2026

View reviewed changes

Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp Outdated

Comment thread src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp Outdated

ChaoWao and others added 2 commits July 2, 2026 17:32

ChaoZheng109 force-pushed the handshake-overlap branch from c56f61d to 6edc7f4 Compare July 2, 2026 09:37

Uh oh!

Conversation

ChaoWao commented Jun 30, 2026

Summary

Change 1 — sweep Step 2 instead of per-core blocking

Change 2 — batch the Step 1 release barrier

Measured — preamble device phase only

Why it stops at ~150 µs (the residual is a physical / protocol floor)

Testing

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ChaoZheng109 commented Jul 2, 2026

Follow-up: collapse the handshake to a single round-trip (c56f61d5)

What & why

Hardware facts validated on-silicon (a2a3, probe on 72 cores)

Measured — handshake_all_cores only (a2a3 onboard, paged_attention, 20 rounds, n=80 real handshakes)

Correctness

Uh oh!

ChaoZheng109 commented Jul 3, 2026

CI triage — st-sim-a5 timeout is pre-existing, unrelated to this change

Uh oh!

ChaoZheng109 commented Jul 3, 2026

preamble measurement (STRACE) — clean toggle on the current tree

Uh oh!

ChaoZheng109 commented Jul 3, 2026

What the preamble phase actually covers (it is not only the handshake)

Uh oh!

ChaoZheng109 commented Jul 3, 2026

Confirming the preamble win is exactly the handshake win

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Measured — `preamble` device phase only

coderabbitai Bot commented Jun 30, 2026 •

edited

Loading

Follow-up: collapse the handshake to a single round-trip (`c56f61d5`)

Measured — `handshake_all_cores` only (a2a3 onboard, `paged_attention`, 20 rounds, n=80 real handshakes)

CI triage — `st-sim-a5` timeout is pre-existing, unrelated to this change

`preamble` measurement (STRACE) — clean toggle on the current tree

What the `preamble` phase actually covers (it is not only the handshake)

Confirming the `preamble` win is exactly the handshake win