Skip to content

perf(runtime): overlap AICore handshake wakeups; batch the release barrier#1214

Open
ChaoWao wants to merge 2 commits into
hw-native-sys:mainfrom
ChaoWao:handshake-overlap
Open

perf(runtime): overlap AICore handshake wakeups; batch the release barrier#1214
ChaoWao wants to merge 2 commits into
hw-native-sys:mainfrom
ChaoWao:handshake-overlap

Conversation

@ChaoWao

@ChaoWao ChaoWao commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

Summary

handshake_all_cores (run once per run_prepared, on the orchestrator thread,
inside the preamble device phase) ran two costs serially that can be
parallelized. Both AICore cores wake and advance their handshake phases
independently, so blocking on one core before looking at the next made their
latencies sum instead of overlap.

Change 1 — sweep Step 2 instead of per-core blocking

The old loop did, for each core: while(wait aicore_regs_ready) → init regs →
while(wait aicore_done), fully draining core i before touching core i+1.
That serializes 72 cores' wakeup waits. Replaced with two phase-batched
sweeps
: poll every still-outstanding core per pass and service whichever are
ready. The per-core wakeup waits now overlap (≈ slowest single core, not the
sum). The handshake flags are GM reads, not the nGnRE MMIO register
window, so sweeping is not subject to the serial-LDR constraint that RegId::COND
polling is.

Change 2 — batch the Step 1 release barrier

Step 1 raised aicpu_ready inside the per-core loop with an
OUT_OF_ORDER_STORE_BARRIER() each iteration (71 redundant barriers). The
task pointers are now published with one barrier, then aicpu_ready is raised
for all cores. One barrier suffices — AICore only relies on "all task stores
globally visible before any aicpu_ready store".

Applied symmetrically to a2a3 and a5.

Measured — preamble device phase only

qwen3-14B, 3.5k-context decode, a2a3 onboard, PTO2_RING_TASK_WINDOW=524288,
averaged over the 19 decode steps:

preamble value
before (per-core blocking + per-iter barrier) 329 µs
after sweep Step 2 213 µs
after sweep + batched Step 1 barrier 150 µs

Net −179 µs/step (−54%). Output token IDs identical to baseline.

Why it stops at ~150 µs (the residual is a physical / protocol floor)

The remaining preamble is dominated by costs no AICPU-side code change can
remove:

  1. AICore launch + NoC cold-wakeuprtKernelLaunchWithHandleV2 lazily
    loads the AICore kernel binary onto the device on first launch, and each core
    must be woken across the NoC before it reaches the handshake point. The sweep
    already overlaps these waits to the slowest single core; that core's wakeup
    is a hardware floor.
  2. One structurally-required GM round-trip to bind logical block_idx
    runtime-assigned physical_core_id. Three follow-on ideas to move this off
    GM were each ruled out by verified hardware constraints:
    • Core initializes its own dispatch regs (to drop the round-trip): AICore
      cannot write DATA_MAIN_BASE / FAST_PATH_ENABLE — the SPR write is
      rejected by the CCEC backend and an MMIO STR to the SPR window hangs the
      chip (.claude/rules/ascend.md, docs/hardware/mmio-performance.md). Reg
      init must stay on the AICPU.
    • Report physical_core_id / signals over COND instead of GM: to poll a
      core's COND the AICPU needs that core's reg_addr = regs[physical_core_id]
      — which is exactly the binding the handshake is establishing. The register
      channel cannot bootstrap the binding it depends on.
    • Acknowledge reg-init over DMB instead of a GM flag: the core must
      distinguish "AICPU has not initialized my regs yet" from "done", which
      needs a host-preclearable "not ready" sentinel. GM flags have one (host
      zeroes workers[] each run); SPRs (DMB/COND) do not — the host cannot write
      them, so a stale value would be misread. The init-ack must stay a GM flag.

So preamble's floor is: AICore binary load + NoC wakeup + one GM round-trip for
the logical↔physical binding. The accumulation (sweep) and the redundant
barriers (batch) — the parts that were software overhead — are removed here.

Testing

  • a2a3 onboard, output tokens identical to baseline across the run.
  • a2a3 onboard correctness across the handshake → dispatch → completion
    paths: vector_example, async_notify_demo, sdma_async_completion_demo,
    paged_attention, multi_round_paged_attention, batch_paged_attention — all pass.
  • a5 onboard (no a5 silicon on this box — relies on CI; change is structurally
    identical to a2a3).

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the core handshake process in scheduler_cold_path.cpp (for both a2a3 and a5 runtimes) by splitting it into two non-blocking sweeps (Sweep A and Sweep B) to overlap core wakeups and reduce latency. The review feedback highlights a correctness bug where core_exec_states_[i].reg_addr is left uninitialized if validation fails during Sweep A, which prevents emergency_shutdown() from correctly cleaning up initialized registers. To resolve this, reg_addr should be populated immediately in Sweep A, and the redundant assignment in Sweep B can be removed.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp Outdated
@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: cdfd98b1-269d-492f-afd8-8808c340f3a0

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

SchedulerContext::handshake_all_cores in both the a2a3 and a5 variants is refactored from a serial per-core blocking loop into a batched two-phase pattern: Step 1 publishes all payload pointers with a single barrier then asserts aicpu_ready for all cores with a second barrier; Step 2 splits into two out-of-order polling sweeps (Sweep A: regs_ready/validate/init/ack; Sweep B: done/latch state).

Changes

Out-of-order sweep handshake refactor

Layer / File(s) Summary
Batched payload publication (Step 1)
src/a2a3/.../scheduler_cold_path.cpp, src/a5/.../scheduler_cold_path.cpp
Both files change Step 1 to write all per-core task pointers, issue one store barrier, set aicpu_ready=1 for all cores, then issue a second barrier—removing the previous per-core barrier between task store and aicpu_ready.
Sweep A: regs_ready polling, validation, and ack
src/a2a3/.../scheduler_cold_path.cpp, src/a5/.../scheduler_cold_path.cpp
Serial regs_ready→init→ack block replaced with a sweep that polls aicore_regs_ready across all cores in repeated passes, validates physical_core_id, calls platform_init_aicore_regs, sets aicpu_regs_ready=1 with a barrier per core, tracks completion via regs_phase_done[], and defers emergency_shutdown/return -1 until after the full sweep if handshake_failed is set.
Sweep B: done polling and state latch
src/a2a3/.../scheduler_cold_path.cpp, src/a5/.../scheduler_cold_path.cpp
Second sweep polls aicore_done across all cores and latches reg_addr, cond_ptr, worker_id, core_type into core_exec_states_[i], populates aic_worker_ids_/aiv_worker_ids_ by core_type, tracked with done_phase_done[].

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 Two sweeps now dance where one loop stood,
barriers batch as they always should.
Regs_ready, done—no more waiting in line,
out-of-order polls let all cores shine.
The rabbit hops fast, no serial delay! 🚀

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main performance change: overlapping AICore wakeups and batching the release barrier.
Description check ✅ Passed The description is directly related to the changes and explains the handshake sweep and barrier batching in detail.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp`:
- Around line 762-781: The failure path in the scheduler cold-path
initialization leaves some initialized register addresses untracked because
`reg_addr` is only stored in Sweep B, so `emergency_shutdown()` can miss regs
set up before an invalid `physical_core_id` triggers `handshake_failed`. Update
the `scheduler_cold_path.cpp` flow around the `reg_addr` initialization and
`emergency_shutdown()` handling so every successfully initialized `reg_addr` is
persisted immediately in `core_exec_states_[i].reg_addr` before any possible
early return, ensuring shutdown can deinitialize all regs consistently.

In
`@src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp`:
- Around line 766-785: The reg address for each initialized core is being
recorded too late, so the failure path in `scheduler_cold_path.cpp` can miss or
reuse stale state after `platform_init_aicore_regs(reg_addr)` has already run.
Move the assignment of the initialized reg address into the same Sweep A path
where `platform_init_aicore_regs` and `hank->aicpu_regs_ready` are set, using
`core_exec_states_[i].reg_addr` so `emergency_shutdown` and the later deinit
logic can see the correct value. Keep the update paired with the existing core
state writes in the loop that handles `physical_core_id`, `reg_addr_of`, and
`regs_phase_done`.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 45bbc16f-c2ed-470f-8f92-4b7d16d5d40b

📥 Commits

Reviewing files that changed from the base of the PR and between 11d03d9 and 445329d.

📒 Files selected for processing (2)
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp

Comment thread src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp Outdated
@ChaoZheng109

Copy link
Copy Markdown
Collaborator

Follow-up: collapse the handshake to a single round-trip (c56f61d5)

Building on the sweep/barrier work here, this commit removes the second
AICPU↔AICore round-trip entirely rather than just overlapping it.

What & why

The old handshake ran two round-trips per run:

  • RT1 — AICore reports physical_core_id (aicore_regs_ready)
  • AICPU opens the core's register window and acks (aicpu_regs_ready)
  • RT2 — AICore confirms up (aicore_done)

RT2 existed only because reg-init needed physical_core_id first. But the
only thing the handshake fundamentally must exchange is each core's
physical_core_id (the block_idx → regs[physical_core_id] binding the
AICPU dispatches through). So:

  • AICore now reports {physical_core_id, core_type, aicore_done} in one
    write, then waits for its window to open by polling its own
    DATA_MAIN_BASE SPR.
  • AICPU sweeps once on aicore_done, opens the window (which is the
    ack), and latches. Sweep A/B, the aicpu_regs_ready ack, and the two
    *_regs_ready Handshake fields are gone.

Hardware facts validated on-silicon (a2a3, probe on 72 cores)

  • A kernel launch resets DATA_MAIN_BASE to 0 (does not persist the
    prior deinit's IDLE) — so 0 is an unambiguous "window not open yet"
    sentinel, and the AICPU's IDLE write (non-zero) is the open signal.
  • The pre-open SPR read is a plain MOV DATA_MAIN_BASEit does not
    hang
    (72/72 cores, all runs completed).

Measured — handshake_all_cores only (a2a3 onboard, paged_attention, 20 rounds, n=80 real handshakes)

metric before (Sweep A/B + RT2) after (single sweep) Δ
warm / steady-state (min) 29.66 µs 14.10 µs −52%
median 46.18 µs 36.15 µs −22%
avg 54.67 µs 34.31 µs −37%

Correctness

  • a2a3 onboard — 9/9 pass: vector_example, paged_attention (72-core),
    async_notify_demo, multi_round_paged_attention (per-run re-handshake),
    batch_paged_attention, mixed_example, spmd_basic, spmd_sync_start,
    spmd_sync_start_stress. Output tokens identical.
  • a2a3 sim — 6/6 pass (incl. multi_round).
  • Applied symmetrically to a5 (no a5 silicon here; relies on CI
    st-onboard-a5 / st-sim-a5).

ChaoWao and others added 2 commits July 2, 2026 17:32
…rrier

handshake_all_cores ran two costs serially that can be parallelized:

1. Step 2 blocked on core i (wait aicore_regs_ready, init regs, wait
   aicore_done) before looking at core i+1, so the 72 AICore cores' wakeup
   latencies summed. The cores wake and advance independently, so this is now
   two phase-batched sweeps (poll every outstanding core per pass, service the
   ready ones): the per-core wakeup waits overlap instead of accumulating. The
   handshake flags are GM reads, not the nGnRE MMIO reg window, so sweeping is
   not subject to the serial-LDR constraint that COND polling is.

2. Step 1 raised aicpu_ready inside the per-core loop with a barrier each
   iteration (71 redundant barriers). The task pointers are now published with
   one barrier, then aicpu_ready is raised for all cores — one barrier suffices
   since AICore only relies on "all task stores visible before any aicpu_ready".

Measured (qwen3-14B 3.5k decode, a2a3 onboard, PTO2_RING_TASK_WINDOW=524288):
preamble 329us -> 150us/step. Output tokens identical.

The residual ~150us is the physical floor: AICore launch (rtKernelLaunch lazy
binary load) + NoC cold-wakeup, plus one structurally-required GM round-trip to
bind logical block_idx <-> runtime-assigned physical_core_id (the register
channel cannot bootstrap that binding — it needs the physical core id the
binding is establishing, and has no host-preclearable "not ready" sentinel).
Applied symmetrically to a2a3 and a5.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
handshake_all_cores ran two AICPU<->AICore round-trips per run: RT1
discovered physical_core_id (aicore_regs_ready), the AICPU then opened
the core's register window and acked (aicpu_regs_ready), and RT2
confirmed the core was up (aicore_done). The second round-trip existed
only because reg-init depended on physical_core_id.

The only information the handshake fundamentally must exchange is each
core's physical_core_id -- the logical block_idx -> physical
register-window binding the AICPU needs to dispatch via
regs[physical_core_id]. Everything else was derivable or an artifact of
the reg-init/id chicken-and-egg.

Collapse to one round-trip:
- AICore reports {physical_core_id, core_type, aicore_done} in a single
  write, then waits for its window to open by polling its own
  DATA_MAIN_BASE SPR. A kernel launch resets that SPR to 0 (verified on
  a2a3 silicon, 72 cores); the AICPU writes IDLE (non-zero) as it opens
  FAST_PATH, so a non-zero read means the window is open. The pre-open
  read is a plain SPR MOV -- it cannot hang.
- AICPU sweeps once on aicore_done, validates the id, opens the window
  (which is itself the ack), and latches reg/type. Sweep A/B, the
  aicpu_regs_ready ack, and the aicpu_regs_ready / aicore_regs_ready
  Handshake fields are removed.

Applied symmetrically to a2a3 and a5.

Measured (a2a3 onboard, paged_attention, handshake_all_cores only):
avg 54.7 -> 34.3us (-37%); warm steady-state 29.7 -> 14.1us (-52%).
Output tokens identical; a2a3 onboard + sim regression set green.
@ChaoZheng109

Copy link
Copy Markdown
Collaborator

CI triage — st-sim-a5 timeout is pre-existing, unrelated to this change

The only red check is st-sim-a5 (macos fail, ubuntu cancelled by matrix fail-fast). Root cause from the macos log:

13:37:16  spmd_sync_start_stress starts
13:43:45  [gw0][93%] PASSED spmd_sync_start_stress   (~6m29s)
13:43:45  [pytest] TIMEOUT: session exceeded 600s (10min) limit  → exit 124

All tests PASS — this is a session wall-clock overrun, not a correctness failure. It is not introduced by the handshake change:

  • st-onboard-a5 and st-onboard-a2a3 both pass (the a5 change is functionally correct on silicon).
  • st-sim-a2a3 passes; only a5 sim overruns.
  • The a5 sim suite grew yesterday — test(dfx): mirror a2a3 DFX scene tests to a5 #1246 (test(dfx): mirror a2a3 DFX scene tests to a5, merged 2026-07-02) added 5 DFX scene tests (args_dump, l2_swimlane ×2, pmu, scope_stats). Combined with the ~6.5-min spmd_sync_start_stress tent-pole, the session now sits right at the 600s budget.
  • Under this change spmd_sync_start_stress ran faster than on the base (6m29s vs ~7m46s previously), so the handshake work does not slow it down.

This is a repo-level a5-sim CI-budget issue (suite size vs the 600s session cap), best addressed separately (trim/shard the a5 sim stress test or raise the session budget) rather than in this handshake PR. Rerunning once to confirm it is a hard budget overrun and not a one-off slow runner.

@ChaoZheng109

Copy link
Copy Markdown
Collaborator

preamble measurement (STRACE) — clean toggle on the current tree

Measured the device_wall.preamble device phase via STRACE (simpler_setup.tools.strace_timing) on qwen3-14B 2-layer decode, a2a3 onboard, PTO2_RING_TASK_WINDOW=524288, --rounds 20, warm average (excluding the round-0 cold launch). Only this commit was toggled on the current branch — same binary, same example, same box:

baseline (Sweep A/B + RT2) this change (single sweep) Δ
warm avg 135.5 µs 88.2 µs −47.3 µs (−35%)
warm min 122.0 81.8
warm max 148.0 96.0
cold (round 0) 336.1 308.7 −27 µs

The −47 µs is larger than the isolated handshake_all_cores delta measured on paged_attention (−20 µs) because qwen3 decode runs the full core set: the removed RT2 round-trip + the per-core reg-init ack scale with core count, and preamble is handshake-dominated.

On the earlier "150 µs" in the PR description vs the 135.5 µs baseline here: it is the same 2-layer example (#1088, added 2026-06-25, predates this PR), so this is not an example change. The difference is (a) this PR's branch is now rebased onto ~28 newer main commits, several of which touch the per-run / preamble / arena path (#1242 per-run simpler_run state lifecycle, #1193 prebuilt-arena image caching, #1216/#1227 Runtime split + HostApi refactors), and (b) run-to-run jitter — the warm baseline here spans 122–148 µs (σ ≈ 7 µs), so 150 µs sits at ~2σ, within the same band. The confound-free figure is the toggle above (135.5 → 88.2 µs on identical inputs).

@ChaoZheng109

Copy link
Copy Markdown
Collaborator

What the preamble phase actually covers (it is not only the handshake)

preamble brackets the entire per-run AICPU init() on the device, not just the handshake. In aicpu_execute() (aicpu_executor.cpp:878):

{
    AicpuPhaseScope preamble(AicpuPhase::Preamble);
    g_aicpu_executor.init(runtime);   // everything below
    while (!init_done_) { ... }        // the other AICPU threads spin here
}

init()SchedulerContext::init() runs, in order:

  1. AICPU thread-pool init barrier — the first thread runs init; the other aicpu_thread_num - 1 threads spin until init_done_.
  2. handshake_all_cores() — the AICPU↔AICore discovery handshake (what this PR changes). The AICPU blocks here until every core wakes across the NoC and reports, opens each core's register window, and (pre-PR) ran the second ack round-trip. The structural floor lives inside this wait: AICore kernel binary load on first launch + per-core NoC cold-wakeup.
  3. assign_cores_to_threads() — cluster-aligned round-robin assignment of the discovered cores to scheduler threads.
  4. Profiling-subsystem initl2_swimlane_aicpu_init / pmu_aicpu_init / dump_args_init / dep_gen_aicpu_init, only when the corresponding profiling flags are set (OFF in the numbers above).
  5. Task-counter init — read the total task count from PTO2 shared memory.
  6. Per-core payload setuppayload_per_core_ reset + SPMD sub_block_id wiring.

So preamble = thread-barrier + handshake + core-assignment + (profiling init) + task-counters + payload setup. The handshake is the dominant, blocking component — it is the part that stalls on the NoC wakeup + protocol round-trips — so this PR's handshake change moves preamble by ~the handshake delta, while steps 1/3/5/6 are fast and unchanged.

If it's useful I can re-instrument handshake_all_cores on the qwen3 workload to give the exact handshake-vs-rest split inside the 135.5 → 88.2 µs (the way the isolated handshake_all_cores was measured on paged_attention: 54.7 → 34.3 µs).

@ChaoZheng109

Copy link
Copy Markdown
Collaborator

Confirming the preamble win is exactly the handshake win

To verify that the preamble reduction is fully explained by the handshake change (and not by some other phase), I instrumented handshake_all_cores() with a timer and measured it alongside preamble on the same qwen3-14B 2-layer decode workload (a2a3 onboard, --rounds 20, warm avg n=19):

phase baseline (Sweep A/B + RT2) this change (single sweep) Δ
handshake_all_cores 88.2 µs 41.3 µs −46.9 µs
preamble (clean, un-instrumented) 135.5 µs 88.2 µs −47.3 µs

Δhandshake (46.9 µs) ≈ Δpreamble (47.3 µs) — they match to within 0.4 µs, so the entire preamble improvement comes from the handshake.

Decomposition check — the non-handshake part of preamble (thread-pool init barrier + assign_cores_to_threads + task-counter init + payload_per_core_ setup) is preamble − handshake:

  • baseline: 135.5 − 88.2 = 47.3 µs
  • this change: 88.2 − 41.3 = 46.9 µs

i.e. the non-handshake remainder is unchanged (~47 µs both sides), as expected — this PR only touches the handshake. Handshake share of preamble: 65% → 46% (it shrank; the rest held constant).

(The handshake timer is a temporary LOG_INFO_V9 used only for this measurement — not part of the PR.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants