perf(runtime): overlap AICore handshake wakeups; batch the release barrier#1214
perf(runtime): overlap AICore handshake wakeups; batch the release barrier#1214ChaoWao wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the core handshake process in scheduler_cold_path.cpp (for both a2a3 and a5 runtimes) by splitting it into two non-blocking sweeps (Sweep A and Sweep B) to overlap core wakeups and reduce latency. The review feedback highlights a correctness bug where core_exec_states_[i].reg_addr is left uninitialized if validation fails during Sweep A, which prevents emergency_shutdown() from correctly cleaning up initialized registers. To resolve this, reg_addr should be populated immediately in Sweep A, and the redundant assignment in Sweep B can be removed.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthrough
ChangesOut-of-order sweep handshake refactor
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp`:
- Around line 762-781: The failure path in the scheduler cold-path
initialization leaves some initialized register addresses untracked because
`reg_addr` is only stored in Sweep B, so `emergency_shutdown()` can miss regs
set up before an invalid `physical_core_id` triggers `handshake_failed`. Update
the `scheduler_cold_path.cpp` flow around the `reg_addr` initialization and
`emergency_shutdown()` handling so every successfully initialized `reg_addr` is
persisted immediately in `core_exec_states_[i].reg_addr` before any possible
early return, ensuring shutdown can deinitialize all regs consistently.
In
`@src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp`:
- Around line 766-785: The reg address for each initialized core is being
recorded too late, so the failure path in `scheduler_cold_path.cpp` can miss or
reuse stale state after `platform_init_aicore_regs(reg_addr)` has already run.
Move the assignment of the initialized reg address into the same Sweep A path
where `platform_init_aicore_regs` and `hank->aicpu_regs_ready` are set, using
`core_exec_states_[i].reg_addr` so `emergency_shutdown` and the later deinit
logic can see the correct value. Keep the update paired with the existing core
state writes in the loop that handles `physical_core_id`, `reg_addr_of`, and
`regs_phase_done`.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 45bbc16f-c2ed-470f-8f92-4b7d16d5d40b
📒 Files selected for processing (2)
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cppsrc/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp
Follow-up: collapse the handshake to a single round-trip (
|
| metric | before (Sweep A/B + RT2) | after (single sweep) | Δ |
|---|---|---|---|
| warm / steady-state (min) | 29.66 µs | 14.10 µs | −52% |
| median | 46.18 µs | 36.15 µs | −22% |
| avg | 54.67 µs | 34.31 µs | −37% |
Correctness
- a2a3 onboard — 9/9 pass:
vector_example,paged_attention(72-core),
async_notify_demo,multi_round_paged_attention(per-run re-handshake),
batch_paged_attention,mixed_example,spmd_basic,spmd_sync_start,
spmd_sync_start_stress. Output tokens identical. - a2a3 sim — 6/6 pass (incl.
multi_round). - Applied symmetrically to a5 (no a5 silicon here; relies on CI
st-onboard-a5/st-sim-a5).
…rrier handshake_all_cores ran two costs serially that can be parallelized: 1. Step 2 blocked on core i (wait aicore_regs_ready, init regs, wait aicore_done) before looking at core i+1, so the 72 AICore cores' wakeup latencies summed. The cores wake and advance independently, so this is now two phase-batched sweeps (poll every outstanding core per pass, service the ready ones): the per-core wakeup waits overlap instead of accumulating. The handshake flags are GM reads, not the nGnRE MMIO reg window, so sweeping is not subject to the serial-LDR constraint that COND polling is. 2. Step 1 raised aicpu_ready inside the per-core loop with a barrier each iteration (71 redundant barriers). The task pointers are now published with one barrier, then aicpu_ready is raised for all cores — one barrier suffices since AICore only relies on "all task stores visible before any aicpu_ready". Measured (qwen3-14B 3.5k decode, a2a3 onboard, PTO2_RING_TASK_WINDOW=524288): preamble 329us -> 150us/step. Output tokens identical. The residual ~150us is the physical floor: AICore launch (rtKernelLaunch lazy binary load) + NoC cold-wakeup, plus one structurally-required GM round-trip to bind logical block_idx <-> runtime-assigned physical_core_id (the register channel cannot bootstrap that binding — it needs the physical core id the binding is establishing, and has no host-preclearable "not ready" sentinel). Applied symmetrically to a2a3 and a5. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
handshake_all_cores ran two AICPU<->AICore round-trips per run: RT1
discovered physical_core_id (aicore_regs_ready), the AICPU then opened
the core's register window and acked (aicpu_regs_ready), and RT2
confirmed the core was up (aicore_done). The second round-trip existed
only because reg-init depended on physical_core_id.
The only information the handshake fundamentally must exchange is each
core's physical_core_id -- the logical block_idx -> physical
register-window binding the AICPU needs to dispatch via
regs[physical_core_id]. Everything else was derivable or an artifact of
the reg-init/id chicken-and-egg.
Collapse to one round-trip:
- AICore reports {physical_core_id, core_type, aicore_done} in a single
write, then waits for its window to open by polling its own
DATA_MAIN_BASE SPR. A kernel launch resets that SPR to 0 (verified on
a2a3 silicon, 72 cores); the AICPU writes IDLE (non-zero) as it opens
FAST_PATH, so a non-zero read means the window is open. The pre-open
read is a plain SPR MOV -- it cannot hang.
- AICPU sweeps once on aicore_done, validates the id, opens the window
(which is itself the ack), and latches reg/type. Sweep A/B, the
aicpu_regs_ready ack, and the aicpu_regs_ready / aicore_regs_ready
Handshake fields are removed.
Applied symmetrically to a2a3 and a5.
Measured (a2a3 onboard, paged_attention, handshake_all_cores only):
avg 54.7 -> 34.3us (-37%); warm steady-state 29.7 -> 14.1us (-52%).
Output tokens identical; a2a3 onboard + sim regression set green.
c56f61d to
6edc7f4
Compare
CI triage —
|
|
| baseline (Sweep A/B + RT2) | this change (single sweep) | Δ | |
|---|---|---|---|
| warm avg | 135.5 µs | 88.2 µs | −47.3 µs (−35%) |
| warm min | 122.0 | 81.8 | |
| warm max | 148.0 | 96.0 | |
| cold (round 0) | 336.1 | 308.7 | −27 µs |
The −47 µs is larger than the isolated handshake_all_cores delta measured on paged_attention (−20 µs) because qwen3 decode runs the full core set: the removed RT2 round-trip + the per-core reg-init ack scale with core count, and preamble is handshake-dominated.
On the earlier "150 µs" in the PR description vs the 135.5 µs baseline here: it is the same 2-layer example (#1088, added 2026-06-25, predates this PR), so this is not an example change. The difference is (a) this PR's branch is now rebased onto ~28 newer main commits, several of which touch the per-run / preamble / arena path (#1242 per-run simpler_run state lifecycle, #1193 prebuilt-arena image caching, #1216/#1227 Runtime split + HostApi refactors), and (b) run-to-run jitter — the warm baseline here spans 122–148 µs (σ ≈ 7 µs), so 150 µs sits at ~2σ, within the same band. The confound-free figure is the toggle above (135.5 → 88.2 µs on identical inputs).
What the
|
Confirming the
|
| phase | baseline (Sweep A/B + RT2) | this change (single sweep) | Δ |
|---|---|---|---|
handshake_all_cores |
88.2 µs | 41.3 µs | −46.9 µs |
preamble (clean, un-instrumented) |
135.5 µs | 88.2 µs | −47.3 µs |
Δhandshake (46.9 µs) ≈ Δpreamble (47.3 µs) — they match to within 0.4 µs, so the entire preamble improvement comes from the handshake.
Decomposition check — the non-handshake part of preamble (thread-pool init barrier + assign_cores_to_threads + task-counter init + payload_per_core_ setup) is preamble − handshake:
- baseline: 135.5 − 88.2 = 47.3 µs
- this change: 88.2 − 41.3 = 46.9 µs
i.e. the non-handshake remainder is unchanged (~47 µs both sides), as expected — this PR only touches the handshake. Handshake share of preamble: 65% → 46% (it shrank; the rest held constant).
(The handshake timer is a temporary LOG_INFO_V9 used only for this measurement — not part of the PR.)
Summary
handshake_all_cores(run once perrun_prepared, on the orchestrator thread,inside the
preambledevice phase) ran two costs serially that can beparallelized. Both AICore cores wake and advance their handshake phases
independently, so blocking on one core before looking at the next made their
latencies sum instead of overlap.
Change 1 — sweep Step 2 instead of per-core blocking
The old loop did, for each core:
while(wait aicore_regs_ready)→ init regs →while(wait aicore_done), fully draining coreibefore touching corei+1.That serializes 72 cores' wakeup waits. Replaced with two phase-batched
sweeps: poll every still-outstanding core per pass and service whichever are
ready. The per-core wakeup waits now overlap (≈ slowest single core, not the
sum). The handshake flags are GM reads, not the
nGnREMMIO registerwindow, so sweeping is not subject to the serial-LDR constraint that
RegId::CONDpolling is.
Change 2 — batch the Step 1 release barrier
Step 1 raised
aicpu_readyinside the per-core loop with anOUT_OF_ORDER_STORE_BARRIER()each iteration (71 redundant barriers). Thetask pointers are now published with one barrier, then
aicpu_readyis raisedfor all cores. One barrier suffices — AICore only relies on "all
taskstoresglobally visible before any
aicpu_readystore".Applied symmetrically to a2a3 and a5.
Measured —
preambledevice phase onlyqwen3-14B, 3.5k-context decode, a2a3 onboard,
PTO2_RING_TASK_WINDOW=524288,averaged over the 19 decode steps:
preambleNet −179 µs/step (−54%). Output token IDs identical to baseline.
Why it stops at ~150 µs (the residual is a physical / protocol floor)
The remaining
preambleis dominated by costs no AICPU-side code change canremove:
rtKernelLaunchWithHandleV2lazilyloads the AICore kernel binary onto the device on first launch, and each core
must be woken across the NoC before it reaches the handshake point. The sweep
already overlaps these waits to the slowest single core; that core's wakeup
is a hardware floor.
block_idx↔runtime-assigned
physical_core_id. Three follow-on ideas to move this offGM were each ruled out by verified hardware constraints:
cannot write
DATA_MAIN_BASE/FAST_PATH_ENABLE— the SPR write isrejected by the CCEC backend and an MMIO STR to the SPR window hangs the
chip (
.claude/rules/ascend.md,docs/hardware/mmio-performance.md). Reginit must stay on the AICPU.
physical_core_id/ signals over COND instead of GM: to poll acore's COND the AICPU needs that core's
reg_addr = regs[physical_core_id]— which is exactly the binding the handshake is establishing. The register
channel cannot bootstrap the binding it depends on.
distinguish "AICPU has not initialized my regs yet" from "done", which
needs a host-preclearable "not ready" sentinel. GM flags have one (host
zeroes
workers[]each run); SPRs (DMB/COND) do not — the host cannot writethem, so a stale value would be misread. The init-ack must stay a GM flag.
So
preamble's floor is: AICore binary load + NoC wakeup + one GM round-trip forthe logical↔physical binding. The accumulation (sweep) and the redundant
barriers (batch) — the parts that were software overhead — are removed here.
Testing
paths:
vector_example,async_notify_demo,sdma_async_completion_demo,paged_attention,multi_round_paged_attention,batch_paged_attention— all pass.identical to a2a3).