Add: fully_distributed_within_core runtime — SPMD on-core orchestration#1142
Add: fully_distributed_within_core runtime — SPMD on-core orchestration#1142hengliao1972 wants to merge 25 commits into
Conversation
Introduce a new runtime where orchestration, scheduling, and execution all
run in SPMD fashion on the AICore workers themselves, with AICPU reduced to a
thin init/handoff/teardown stub (no scheduler). Each core builds, owns, and
executes its own tasks.
Design pillars (see docs/fully_distributed_within_core.md):
- Task ownership via a claim race over two global cursors (cube_cursor for
AIC-anchored, vector_cursor for AIV-only).
- owner = builder = executor, with core-type matching.
- Full per-core duplicate TensorMap for dependency discovery.
- Per-core private task ring + one global completion-flag ring driving a
run-ahead, pull-based execution loop.
- block.won anchor/follower co-ownership for multi-core (MIX / 2V) tasks.
- Bounded GM output-heap ring reclaimed by a global completion frontier.
Includes: - docs/fully_distributed_within_core.md (authoritative design, in Chinese).
- src/{a2a3,a5}/runtime/fully_distributed_within_core/ runtime skeleton +
dist_engine (a2a3 validated on a2a3sim via benchmark_bgemm + mix_coown).
- examples/a2a3/fully_distributed_within_core/ migrated examples.
- tests/st/a2a3/fully_distributed_within_core/ migrated ST tests.
Co-authored-by: Cursor <cursoragent@cursor.com>
|
Important Review skippedToo many files! This PR contains 313 files, which is 163 over the limit of 150. To get a review, narrow the scope: Upgrade to a paid plan to raise the limit. ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (313)
You can disable this status message by setting the Use the checkbox below for a quick retry:
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces the fully_distributed_within_core runtime mode, enabling orchestration, scheduling, and execution to run entirely on the AI cores in an SPMD fashion, bypassing the AICPU. It includes comprehensive documentation, core executor updates, and several new test cases and examples (such as paged_attention and qwen3_14b_decode). The reviewer provided valuable feedback focusing on robustness, security, and performance. Key recommendations include: explicitly resetting persistent global/static variables (like initialized_) in aicpu_executor.cpp to prevent state leakage across launches; gating expensive MMIO register reads (like DATA_MAIN_BASE) in tight spin loops to reduce CPU overhead; clamping task overhead metrics to a minimum of 0 to prevent negative values caused by clock skew; using RAII cleanup structs for temporary .so files to guarantee deletion on all exit paths; validating signed integer count/size parameters to prevent signed-to-unsigned conversion overflows in orchestration kernels; and avoiding angle brackets in Markdown bash code blocks to prevent input redirection parsing.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| int32_t AicpuExecutor::init(Runtime *runtime) { | ||
| bool expected = false; | ||
| if (!initialized_.compare_exchange_strong(expected, true, std::memory_order_acq_rel, std::memory_order_acquire)) { | ||
| return 0; | ||
| } |
There was a problem hiding this comment.
On platforms where static or global variables persist across launches on the same loaded binary (such as AICPU), always explicitly reset or initialize these variables at the start of each execution/launch to prevent state leakage and undefined behavior between runs. If a previous launch aborted or crashed, deinit might not be called, leaving initialized_ as true. On the next launch, init will return 0 immediately without re-initializing. Consider detecting a new launch (e.g., by comparing the runtime pointer) to safely reset the executor state.
References
- On platforms where static or global variables persist across launches on the same loaded binary (such as AICPU), always explicitly reset or initialize these variables at the start of each execution/launch to prevent state leakage and undefined behavior between runs.
| while (true) { | ||
| uint32_t reg_val = static_cast<uint32_t>(read_reg(RegId::DATA_MAIN_BASE)); | ||
| if (reg_val == AICORE_EXIT_SIGNAL) { | ||
| write_reg(RegId::COND, AICORE_EXITED_VALUE); | ||
| break; | ||
| } | ||
| SPIN_WAIT_HINT(); | ||
| } |
There was a problem hiding this comment.
Reading system registers (such as DATA_MAIN_BASE) via MMIO reads inside a tight spin loop can be extremely expensive and cause high CPU overhead and latency impact. Consider gating this check to run periodically (e.g., every 1024 spins) to minimize overhead.
uint32_t spin_count = 0;
while (true) {
if (spin_count++ % 1024 == 0) {
uint32_t reg_val = static_cast<uint32_t>(read_reg(RegId::DATA_MAIN_BASE));
if (reg_val == AICORE_EXIT_SIGNAL) {
write_reg(RegId::COND, AICORE_EXITED_VALUE);
break;
}
}
SPIN_WAIT_HINT();
}References
- Avoid reading system counters (which can be expensive MMIO reads) or performing complex structural checks on every iteration of a tight spin loop. Instead, gate these checks to run periodically (e.g., every 1024 spins) to minimize CPU overhead and latency impact.
| LOG_INFO_V9( | ||
| "Thread %d: task+heap_alloc: %.3fus (%.1f%%) work=%.3fus wait=%.3fus atomics=%" PRIu64 "", | ||
| thread_idx, cycles_to_us(p.alloc_cycle), p.alloc_cycle * 100.0 / total, | ||
| cycles_to_us(p.alloc_cycle - p.alloc_wait_cycle), cycles_to_us(p.alloc_wait_cycle), | ||
| static_cast<uint64_t>(p.alloc_atomic_count) | ||
| ); |
There was a problem hiding this comment.
When calculating task overhead metrics, clamp the values to a minimum of 0 to prevent negative overhead accumulation. This avoids distortion of metrics caused by dual-issue concurrent execution or clock synchronization skew between different processors.
LOG_INFO_V9(
"Thread %d: task+heap_alloc: %.3fus (%.1f%%) work=%.3fus wait=%.3fus atomics=%" PRIu64 "",
thread_idx, cycles_to_us(p.alloc_cycle), p.alloc_cycle * 100.0 / total,
cycles_to_us(p.alloc_cycle > p.alloc_wait_cycle ? p.alloc_cycle - p.alloc_wait_cycle : 0), cycles_to_us(p.alloc_wait_cycle),
static_cast<uint64_t>(p.alloc_atomic_count)
);| ); | ||
| LOG_INFO_V9( | ||
| "Thread %d: fanin+ready : %.3fus (%.1f%%) work=%.3fus wait=%.3fus", thread_idx, | ||
| cycles_to_us(p.fanin_cycle), p.fanin_cycle * 100.0 / total, | ||
| cycles_to_us(p.fanin_cycle - p.fanin_wait_cycle), cycles_to_us(p.fanin_wait_cycle) |
There was a problem hiding this comment.
When calculating task overhead metrics, clamp the values to a minimum of 0 to prevent negative overhead accumulation. This avoids distortion of metrics caused by dual-issue concurrent execution or clock synchronization skew between different processors.
LOG_INFO_V9(
"Thread %d: fanin+ready : %.3fus (%.1f%%) work=%.3fus wait=%.3fus", thread_idx,
cycles_to_us(p.fanin_cycle), p.fanin_cycle * 100.0 / total,
cycles_to_us(p.fanin_cycle > p.fanin_wait_cycle ? p.fanin_cycle - p.fanin_wait_cycle : 0), cycles_to_us(p.fanin_wait_cycle)
);| ); | ||
| unlink(so_path); | ||
| continue; | ||
| } | ||
| file_created = true; | ||
| LOG_INFO_V0("Thread %d: Created SO file at %s (%zu bytes)", thread_idx, so_path, so_size); | ||
| } | ||
|
|
||
| if (!file_created) { | ||
| LOG_ERROR("Thread %d: Failed to create SO file in any candidate path", thread_idx); | ||
| // Unblock scheduler threads before returning so they don't spin forever. | ||
| runtime_init_ready_.store(true, std::memory_order_release); | ||
| return -1; | ||
| } | ||
|
|
||
| dlerror(); | ||
| void *handle = dlopen(so_path, RTLD_LAZY | RTLD_LOCAL); | ||
| const char *dlopen_err = dlerror(); | ||
| if (handle == nullptr) { | ||
| LOG_ERROR("Thread %d: dlopen failed: %s", thread_idx, dlopen_err ? dlopen_err : "unknown"); | ||
| unlink(so_path); | ||
| // Unblock scheduler threads before returning so they don't spin forever. | ||
| runtime_init_ready_.store(true, std::memory_order_release); | ||
| return -1; | ||
| } | ||
| LOG_INFO_V0("Thread %d: dlopen succeeded, handle=%p", thread_idx, handle); |
There was a problem hiding this comment.
Using an RAII cleanup struct is much safer and cleaner than manually unlinking the temporary .so file on every error path. This guarantees deletion on all exit paths, including any unexpected early returns or future modifications.
if (!file_created) {
LOG_ERROR("Thread %d: Failed to create SO file in any candidate path", thread_idx);
// Unblock scheduler threads before returning so they don't spin forever.
runtime_init_ready_.store(true, std::memory_order_release);
return -1;
}
struct SoCleanup {
const char* path;
SoCleanup(const char* p) : path(p) {}
~SoCleanup() { if (path && path[0] != '\0') { unlink(path); } }
void cancel() { path = nullptr; }
} cleanup(so_path);
dlerror();
void *handle = dlopen(so_path, RTLD_LAZY | RTLD_LOCAL);
const char *dlopen_err = dlerror();
if (handle == nullptr) {
LOG_ERROR("Thread %d: dlopen failed: %s", thread_idx, dlopen_err ? dlopen_err : "unknown");
// Unblock scheduler threads before returning so they don't spin forever.
runtime_init_ready_.store(true, std::memory_order_release);
return -1;
}
LOG_INFO_V0("Thread %d: dlopen succeeded, handle=%p", thread_idx, handle);
// Unlink the on-disk SO immediately: dlopen has already mmap'd
// the image, so the kernel keeps the inode alive until the
// matching dlclose / process exit. This prevents stale
// libdevice_orch_<pid>_<cid>.so files from accumulating in
// /tmp when child processes exit via os._exit(0), which skips
// ~AicpuExecutor (worker.py: _sub/_chip/_child loops).
cleanup.cancel();
unlink(so_path);References
- Use secure temporary file creation methods like
mkstempinstead of predictable paths, and manage the file's lifetime using an RAII cleanup struct to guarantee deletion on all exit paths.
| # Extract orchestrator profiling (Thread 3) | ||
| grep "Thread 3:" <logfile> | ||
|
|
||
| # Extract scheduler profiling (Threads 0/1/2) | ||
| grep -E "Thread [012]:" <logfile> |
There was a problem hiding this comment.
In Markdown bash code blocks, avoid using angle brackets for placeholders (e.g., <logfile>) to prevent them from being parsed as input redirection. Use safe, quoted placeholders like "path/to/logfile" instead.
| # Extract orchestrator profiling (Thread 3) | |
| grep "Thread 3:" <logfile> | |
| # Extract scheduler profiling (Threads 0/1/2) | |
| grep -E "Thread [012]:" <logfile> | |
| # Extract orchestrator profiling (Thread 3) | |
| grep "Thread 3:" "path/to/logfile" | |
| # Extract scheduler profiling (Threads 0/1/2) | |
| grep -E "Thread [012]:" "path/to/logfile" |
References
- In Markdown bash code blocks, avoid using angle brackets for placeholders (e.g.,
<path>) to prevent them from being parsed as input redirection. Use safe, quoted placeholders like "path/to/file" instead.
| uint32_t cl_idx[1] = {static_cast<uint32_t>(b_idx)}; | ||
| uint64_t cur_seq = static_cast<uint64_t>(get_tensor_data<int32_t>(context_lens, 1, cl_idx)); | ||
| uint64_t bn_this_batch = (cur_seq + block_size - 1) / block_size; |
There was a problem hiding this comment.
When using a signed integer for a count or size parameter in functions that perform bounds checks and memory operations, always validate that the count is non-negative to prevent signed-to-unsigned conversion overflows.
uint32_t cl_idx[1] = {static_cast<uint32_t>(b_idx)};
int32_t seq_len = get_tensor_data<int32_t>(context_lens, 1, cl_idx);
always_assert(seq_len >= 0);
uint64_t cur_seq = static_cast<uint64_t>(seq_len);
uint64_t bn_this_batch = (cur_seq + block_size - 1) / block_size;References
- When using a signed integer for a count or size parameter in functions that perform bounds checks and memory operations (such as memcpy or array indexing), always validate that the count is non-negative (e.g., count >= 0) to prevent bypassing bounds checks and causing out-of-bounds writes or signed-to-unsigned conversion overflows.
| PTO2_SCOPE(PTO2ScopeMode::MANUAL) { | ||
| CYCLE_COUNT_LAP(prof_scope); | ||
| uint64_t cur_offset = b_idx * q_head_num + q_idx * q_tile; |
There was a problem hiding this comment.
When using a signed integer for a count or size parameter in functions that perform bounds checks and memory operations, always validate that the count is non-negative to prevent signed-to-unsigned conversion overflows.
uint32_t cl_idx[1] = {static_cast<uint32_t>(b_idx)};
int32_t seq_len = get_tensor_data<int32_t>(context_lens, 1, cl_idx);
always_assert(seq_len >= 0);
uint64_t cur_seq = static_cast<uint64_t>(seq_len);
uint64_t bn_this_batch = (cur_seq + block_size - 1) / block_size;References
- When using a signed integer for a count or size parameter in functions that perform bounds checks and memory operations (such as memcpy or array indexing), always validate that the count is non-negative (e.g., count >= 0) to prevent bypassing bounds checks and causing out-of-bounds writes or signed-to-unsigned conversion overflows.
| void *query_ptr = orch_args.tensor(0).ref().data_as<void>(); | ||
| void *kc_ptr = orch_args.tensor(1).ref().data_as<void>(); | ||
| void *vc_ptr = orch_args.tensor(2).ref().data_as<void>(); |
There was a problem hiding this comment.
When using a signed integer for a count or size parameter in functions that perform bounds checks and memory operations, always validate that the count is non-negative to prevent signed-to-unsigned conversion overflows.
uint32_t cl_idx[1] = {static_cast<uint32_t>(b_idx)};
int32_t seq_len = get_tensor_data<int32_t>(context_lens, 1, cl_idx);
always_assert(seq_len >= 0);
uint64_t cur_seq = static_cast<uint64_t>(seq_len);
uint64_t bn_this_batch = (cur_seq + block_size - 1) / block_size;References
- When using a signed integer for a count or size parameter in functions that perform bounds checks and memory operations (such as memcpy or array indexing), always validate that the count is non-negative (e.g., count >= 0) to prevent bypassing bounds checks and causing out-of-bounds writes or signed-to-unsigned conversion overflows.
| size_t idx_ctx_len = b; | ||
| int32_t ctx_len = static_cast<int32_t *>(orch_args.tensor(7).ref().data_as<void>())[idx_ctx_len]; | ||
| int64_t pos = (static_cast<int64_t>(ctx_len) - 1); | ||
| int64_t ctx_blocks = ((static_cast<int64_t>(ctx_len) + 255) / 256); |
There was a problem hiding this comment.
When using a signed integer for a count or size parameter in functions that perform bounds checks and memory operations, always validate that the count is non-negative to prevent signed-to-unsigned conversion overflows.
size_t idx_ctx_len = b;
int32_t ctx_len = static_cast<int32_t *>(orch_args.tensor(7).ref().data_as<void>())[idx_ctx_len];
always_assert(ctx_len > 0);
int64_t pos = (static_cast<int64_t>(ctx_len) - 1);
int64_t ctx_blocks = ((static_cast<int64_t>(ctx_len) + 255) / 256);References
- When using a signed integer for a count or size parameter in functions that perform bounds checks and memory operations (such as memcpy or array indexing), always validate that the count is non-negative (e.g., count >= 0) to prevent bypassing bounds checks and causing out-of-bounds writes or signed-to-unsigned conversion overflows.
…ibuted_within_core Engine (dist_engine): rework the core loop to be execute-first — drain ready (sub)tasks and block.won deposits before claiming at most one new task per iteration, and shrink the private ring (kPrivateSlots 8->4) so a single fast core can no longer greedily claim a long run of consecutive long tasks and starve its peers. The out-of-order window now comes from the core-count dimension, not a deep per-core ring (docs §6, §6.1). Swimlane: env-gated (PTO_DIST_SWIMLANE) Chrome-trace exporter built into the engine (the centralized L2 collector is incompatible with the AICPU stub). Each executed (sub)task is one duration event laid out by physical block (pid) and lane AIC/AIV0/AIV1 (tid). aicpu_executor dumps it once all workers finish. New simpler_setup/tools/dist_swimlane_render.py renders a static Gantt PNG; opt-in FullCore24 bgemm case (block_dim=24) fills all 24 AIC + 48 AIV lanes for full-core visualization. Function names: leaf CoreCallable carries no name, so scene_test injects the incore func_id->name map from the CALLABLE spec into the captured swimlane JSON post-run, so Perfetto and the renderer show GEMM/ADD instead of f0/f1. Co-authored-by: Cursor <cursoragent@cursor.com>
Add the captured per-core swimlane PNG (and its source Chrome-trace JSON) for benchmark_bgemm FullCore24 (24 AIC + 48 AIV, 240 GEMM + 240 ADD) and reference it from §6.1, illustrating how the execute-first claim race spreads GEMM evenly across all 24 AIC lanes (load balance). Includes the PTO_DIST_SWIMLANE capture + dist_swimlane_render reproduction steps. Co-authored-by: Cursor <cursoragent@cursor.com>
… cost Engine: add PTO_DIST_SKIP_EXEC gate. When set, execute_slot skips the incore kernel call (every (sub)task is treated as 0-cost and completes instantly) but keeps all ownership/completion/frontier bookkeeping, so the distributed loop terminates identically. This isolates the pure cost of on-core SPMD orchestration + claim race + scheduling from kernel work. New example examples/a2a3/fully_distributed_within_core/runtime_overhead_test: reuses the benchmark_bgemm workload (orchestration + GEMM/ADD incores, referenced not duplicated) and sweeps block_dim (1/2/12/24 blocks = 3..72 cores on a2a3sim), reporting the on-device orchestrator wall per config so the overhead-vs-core-count scaling is visible. Skip-exec runs need golden off; the class is also a valid SceneTestCase (manual cases) for the normal kernels-on path. Finding: device orchestration wall grows ~linearly with core count because every core replays the full orchestration (SPMD) and contends on the shared cursors — the overhead the run-ahead execution is meant to amortize once real kernels run. Co-authored-by: Cursor <cursoragent@cursor.com>
…overhead Document the PTO_DIST_SKIP_EXEC overhead experiment and the runtime_overhead_test benchmark: sweeping block_dim 1/2/12/24 (3..72 cores) on a2a3sim shows the on-device orchestration wall grows ~linearly with core count (≈11× at 24 blocks) because every core replays the full orchestration (SPMD) and contends on the shared cursors — the fixed cost that real-kernel parallelism amortizes. Co-authored-by: Cursor <cursoragent@cursor.com>
- sim aicore kernel (a2a3 + a5): release pthread TLS keys on .so unload via a destructor and abort on pthread_key_create failure, preventing key-pool exhaustion (1024 cap) that caused a SIGSEGV at high block_dim after many dlopen/dlclose cycles in device_runner. - runtime_overhead_test: add --bind CPU-affinity control (none|node:<list>| cpu:<list>); table header now reports the bound physical-core count so pinned-vs-unpinned and over/under-subscription effects are self-describing. - docs/fully_distributed_within_core.md: add §6.3 on CPU pinning, the leak fix, and the best exclusive-binding sweep (120 cores) including us/task.
…tead of O(N^2) The per-core duplicate producer map used a flat array with linear-scan lookup/insert and never retired entries. For single-buffer/many-tile workloads (bgemm) `count` grew with the whole run, making submit O(N^2): isolated 1-block sweeps showed device wall scaling super-linearly (tasks x8 -> device x11, us/task 3.23 -> 4.53). Rewrite DistTensorMap to mirror tensormap_and_ringbuffer's proven PTO2TensorMap: hash buckets by buffer base + doubly-linked bucket chains + per-task entry chains + free list + lazy invalidation + per-task cleanup_retired. Retirement uses the deterministic N-H threshold (H span contract, same bound that recycles the GM heap region), so chains stay bounded by the live H-window and every core's map evolves identically. insert always links a fresh entry (no in-place replace) so cleanup can free a task's entries via its chain; lookup returns the max overlapping producer, equivalent to the old replace-in-place semantics. Also defer built[] assembly until after the claim so the ~2/3 of cores that fail type_match / lose the race skip the tensor copies. Result (1-block isolation, skip-exec): O(N^2) -> O(N); 3840 tasks 34.76 -> 4.01 ms (8.7x), us/task 4.53 -> 0.52; neutral-to-better at 480 tasks. Golden validated on bgemm / paged_attention / paged_attention_ringbuffer / mix_coown. Analysis documented in docs §6.4. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Updated with The per-core duplicate producer map was a flat array with linear-scan lookup/insert that never retired entries; for single-buffer/many-tile workloads (bgemm) Rewrote Result (1-block, skip-exec): 3840 tasks 34.76 → 4.01 ms (8.7×), us/task 4.53 → 0.52; neutral-to-better at 480 tasks. Golden validated on bgemm / paged_attention / paged_attention_ringbuffer / mix_coown. Analysis added in docs §6.4. |
…inning + drain fast paths - device_runner: pin each AICore sim thread 1:1 into one NUMA node via PTO_SIM_AICORE_NUMA_NODE so the AICore working set stays node-local (avoids cross-NUMA noise on this 8-node host); PTO_SIM_AICORE_PIN_VERBOSE logs per-thread placement. - runtime_overhead_test: add --aicore-numa to drive the above; default batch 480->1000 (~2000 tasks) for better fixed-cost amortization; docs. - dist_engine: low-risk orchestration micro-opts (A.1) — empty-ring fast path in drain_phase_b, and a monotone per-block any_pub flag so follower drain_block_won/has_pending_won short-circuit the per-slot won scan for single-core workloads (bgemm), behaviour-identical otherwise. - docs: single-NUMA evaluation methodology, thread-pinning results, baseline/scaling overhead analysis.
…rm-aware overhead --blocks The AICore single-NUMA thread pinning added in f896159 uses cpu_set_t / sched_setaffinity / sched_getcpu, which are Linux-only and broke the macOS host build (unknown type 'cpu_set_t', undeclared 'sched_getcpu'). Wrap the per-thread affinity call in `#if defined(__linux__)`; on other hosts (macOS) AICore pinning is a no-op. Linux behavior is unchanged. Make runtime_overhead_test --blocks default platform-aware: macOS hosts have few physical cores so default to '1-2'; Linux dev boxes default to '1-13'. Explicit --blocks still overrides. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…on losers) Hoist the anchor-type + cursor claim ahead of the producer-map operations so the ~2/3 of cores that fail type_match / lose the race skip the fan-in lookup entirely. They still perform the unconditional output insert, so every core's duplicate TensorMap stays identical (§4); the winner computes fan-in before its own insert (preserving INOUT lookup-before-insert). This is the work that load-balancing across cores can actually amortize: more blocks -> each core wins fewer tasks -> skips more fan-in lookups. Measured dev-vs-1blk (tasks=4000) flattens from 1.7x/2.2x (2/4 blocks) to ~0.7-1.1x. The per-core "floor" (output materialization + output insert, inherent to the per-core duplicate map) is unchanged, so 1-block is flat. Golden validated: bgemm / paged_attention / paged_attention_ringbuffer / mix_coown. Docs §6.4 updated. Co-authored-by: Cursor <cursoragent@cursor.com>
… % G (G=4) Split each per-anchor-type claim cursor (cube_cursor / vector_cursor) into kCursorShards (=4) independent, cache-line-padded sub-cursors. A task id N claims on shard (N % kCursorShards); the shard is a pure function of N (identical on every core, no worker partitioning), so claim semantics stay byte-for-byte equivalent to a single cursor — exactly one owner per task, every core eligible, same progress/skew, same determinism. Sharding ONLY spreads the CAS traffic across G cache lines to cut the false-sharing / coherence contention that dominated us/task at higher core counts (§6.5). Each sub-cursor is alignas(64)-padded so adjacent shards never share a line; all entries init to -1. Golden checks pass (bgemm 24/3-block, mix_coown, paged_attention). Docs: rewrite §6.6 to prove the single-cursor equivalence (vs. the inferior worker-partitioning variant), add §6.5 core-scaling overhead analysis, and correct the H definition to be SCOPE-derived (PC exiting a scope ends the visibility of its tensors) rather than a hardcoded constant; the a2a3 prototype's kHDefault is documented as a conservative stand-in. Co-authored-by: Cursor <cursoragent@cursor.com>
…g results Record the landed sharding (G=4) impact on single-NUMA orchestration overhead: - single cursor -> G=4: ~6-12% lower us/task at mid/high blocks (8-13), curve stays clean and monotonic. - G=4 vs G=8: G=8 is worse (8-13% at mid/high blocks) in the single-NUMA, <=39-core regime -> G=4 is the sweet spot; larger G only pays off cross-NUMA. Also update the §6.5 closing note (sharding + winner-only fan-in now implemented).
Pure clang-format pass with no logic change, split out so the following feature/fix commits carry only semantic diffs (easier review).
fully_distributed_within_core can busy-wait each incore's reference duration (example_exec_time_ns, from the CALLABLE incores) in place of running the real kernel, so a fast a2a3sim run reflects measured on-hardware durations. The switch forces skip-golden since outputs are not computed. - CallConfig carries use_example_exec_time + example_exec_time_ns[64]; the wire layout is kept in sync in worker.py (_CFG_FMT) and remote_wire. - run_prepared plumbs the two fields; a weak runtime_apply_example_exec_time hook lets only fully_distributed implement it (strong override in its runtime_maker), every other runtime links the weak no-op. - scene_test maps CALLABLE example_exec_time_ns -> config and rejects the switch for any runtime other than fully_distributed_within_core; conftest adds --use-example-exec-time. - dist_engine execute_slot busy-waits example_exec_time_ns_[func_id] ns when set, else runs the real kernel. - Add paged_attention_unroll example (a2a3sim) with CaseSimSmall for fast feature validation and example_exec_time_ns on the incores.
…-tagged swimlane Fix a large-scale a2a3sim crash in dist_alloc_tensors and enrich the distributed swimlane so per-stage cost (not just kernel time) is visible. Crash root cause: the heap reclaim back-pressure computed the live window as the UNSIGNED subtraction `heap_next - vend[F-H]`. dist_alloc_tensors was replayed by every core unconditionally, so a core lagging the global frontier by more than H tasks has heap_next < vend[F-H]; the subtraction wrapped to ~2^64, spuriously tripped the "heap ring too small" fatal, and the empty TaskOutputTensors then failed a downstream get_ref assert (SIGABRT). Only the busy-wait replay path (--use-example-exec-time) hit it, at batch>=16 — real kernels keep the frontier close enough that no core lags past H. Fix: give dist_alloc_tensors the same single-owner election dist_submit_impl uses. Materialize outputs + producer-map registration stay per-core (deterministic), then a claim on a new alloc_cursor elects one owner; only that owner runs the reclaim back-pressure and publishes completion. The winner is by construction at/ahead of the frontier (its task is not yet done, so F < N), so the window subtraction can no longer underflow — no arithmetic guard needed. Materializing before the (now winner-only) back-pressure also means a genuine fatal returns a populated result instead of the empty one that asserted. Swimlane: TraceEvent gains a phase (kernel / alloc / build / ringbp / drain_won / replay) so the exported Chrome trace shows the work between kernels — claim/ build, alloc, the ring/heap back-pressure spin (ringbp), and the per-core replay of lost claims — not just kernel spans. The two winner back-pressure spins are timed as a separate ringbp phase so dependency/slot WAITING is no longer misattributed to build (build itself is sub-microsecond; the spin can be hundreds of us under a small ring / few blocks). Capture stays zero-cost when PTO_DIST_SWIMLANE is unset; the trace vector is reserved up front when on so push_back never reallocs mid-run. scene_test's func-id->name pass prefixes non-kernel phases so a task's build span no longer collides with its kernel span on the same lane.
fix(fully_distributed_within_core): alloc 单owner选举修复 heap 回收下溢 + phase化泳道
…ad CPU time) Each swimlane span now also records the thread CPU time it accrued (CLOCK_THREAD_CPUTIME_ID) alongside wall-clock duration. On an oversubscribed sim host a span's wall time inflates while the worker thread is descheduled, making orchestration overhead (alloc / replay / build) look far larger than it is; the CPU-time figure does not inflate, so comparing the two separates genuine work from host-stealing. - TraceEvent carries cpu_us; trace_overhead samples CLOCK_THREAD_CPUTIME_ID at both ends (only when PTO_DIST_SWIMLANE is set — zero cost otherwise). Kernel spans set cpu_us = dur (busy-wait/real-run both burn CPU). - The dumped Chrome trace adds a shadow lane (tid + 10, "<lane>·cpu") that draws the same spans with width = cpu_us. Reading the wall row against the cpu row: long wall + short cpu == descheduled (not work); equal rows == real CPU time (e.g. ringbp's sched_yield spin). cpu_us is also exposed in args. Diagnostic only — no scheduling/execution behavior change.
…se flow edges
Make the distributed swimlane explain WHY a ringbp stalls, hop by hop, instead
of only showing kernel bars. Diagnostic-only (active when PTO_DIST_SWIMLANE is
set); no scheduling/execution behavior change.
- Static dependency edges (cat="dep"): one Chrome-trace flow per resolved
fan-in, recorded at build time as {consumer, producer}. Drawn producer-kernel
-> consumer-kernel, so following arrows backward walks the data-dependency
chain ("what does this task wait on, and what does THAT wait on").
- Slot-release edges (cat="slot"): the real reason a ringbp blocks. On entering
the ring back-pressure the owner snapshots the tasks already occupying its
private ring (the ring only shrinks during the wait, so the snapshot is the
complete set); those tasks must execute to free a slot. Drawn occupant-kernel
-> ringbp. Chains with the dep edges: ringbp --slot--> occupant kernel
--dep--> the occupant's fan-in kernels.
- Wall/CPU split into separate process groups (pid=block vs pid=block+1000),
process_sort_index orders all wall groups above all cpu groups; a long wall
bar with a short cpu bar means descheduled (host oversubscription), equal bars
mean real CPU time. Flow arrows live only in the cpu group.
- In the cpu group, kernel spans go on their own sub-lane (tid=lane+3): a ringbp
span time-contains the kernel that ends its wait, so on one lane perfetto
would nest+hide the kernel; the split keeps both visible.
- CPU-group kernel spans also carry func_id so the framework's func-id->name
pass labels them QK/SF/PV/UP (matching the wall group) instead of fN#id.
DepEdge/slot_edges live on DistCore (sim-only file); recording is gated on
g_trace_on so a normal run is unaffected.
…thread start in sim In sim each AICore "core" is a host pthread the OS schedules in one at a time (hundreds of µs apart on a busy box), so early-claimed tasks begin executing while later cores have not yet been scheduled. The swimlane then shows a long first-task cold-start stagger that is host-scheduling noise, not engine behavior — inflating ringbp waits with a sim-only artifact absent on real hardware (where all cores are powered up and spinning before the first dispatch). Add a startup barrier in dist_core_main: every worker bumps started_count on entry and bare-spins until it reaches num_workers before beginning replay. This makes the trace reflect the "all cores already started and exclusive" hardware semantics — steady-state contention only. Measured (Case1, a2a3sim, --use-example-exec-time): first-kernel stagger 6154→478us, longest ringbp 3869→666us, execution-window wall 11307→6612us.
…behind compile-time macros
Two related swimlane changes, kept together since they touch the same
tracing path:
1. Full-coverage lap tracing
- Add EfDrain (execute-first drain at entry) and Commit (slot alloc +
build_ring_slot publish tail) phases — these were previously outside
every span, so their time showed as unexplained blanks on a lane.
- Overhead spans use a lap cursor (trace_lap): each span is
[last_mark, now), then the cursor advances, so consecutive spans abut
with zero gap (same idiom as pto_orchestrator's CYCLE_COUNT_LAP).
- Reset the lap cursor at each submit entry so the inter-submit orch
round-trip (USER orchestration code) is left un-timed and never
biases EfDrain.
- Store raw ns in TraceEvent; move ns->us conversion to the dump stage
so the hot path does no floating-point division.
- Kernel spans keep start/end recording and their nominal sim_ns
duration, shown independently on the kernel sub-lane.
2. Compile-time gating for the AICore/CCEC target
The swimlane path uses host-only facilities (std::vector, std::chrono,
clock_gettime, fprintf) unavailable under CCEC, so wrap it behind gates
that vanish cleanly when off:
- DIST_TRACE_ENABLED = (PTO2_PROFILING + 0): tracing master switch. Sim
builds define PTO2_PROFILING=1 (via profiling_config.h) so tracing is
on; a CCEC build that omits the macro gets #if 0 and every span
capture, the TraceEvent/TracePhase types, the per-core std::vector
members, and the JSON dump are compiled out.
- DIST_SIM_HOST_CLOCK: gates the sim-only host clock (now_ns) and the
use_example_exec_time busy-wait + g_skip_exec. Separate from tracing
since the busy-wait is a sim feature, not a probe. A #error guards the
illegal trace-on-but-host-clock-off combination.
- Call sites use TRACE_LAP / TRACE_LAP_RESET / TRACE_OVERHEAD macros
that forward to _impl inlines when on and expand to a no-op when off,
so hot paths need no #if and the phase enum need not exist when off.
- dist_engine_dump_trace() keeps an empty-body definition when off so
the aicpu_executor call site still links.
Verified: sim (macro on) rebuilds and the paged-attention swimlane is
identical (same span counts/phases, submit-internal gap == 0); macro off
both -fsyntax-only and a full sim rebuild+run pass (tracing gone,
busy-wait intact, dump a safe no-op).
perf(fully_distributed_within_core): 泳道依赖边/槽位释放边 + wall-cpu 双区诊断
…p with private + shared modes Convert the distributed TensorMap to a single ring-per-bucket structure for both replication modes, selectable via PTO_DIST_TENSORMAP_MODE: - private: per-core replicated ring; plain-integer head/tail (no atomics), N-H deterministic reclaim. - shared: one global ring. Appends are serialized in task-id order by a tm_insert_next sequencer (correct-by-construction: identical content to a private replica at every fan-in resolution). Reads use a single acquire on the monotonic tail watermark, then relaxed per-slot seq reads (ABA guard) -- amortizing the memory-order cost to one acquire per lookup. Dual lookup filters producer in [N-H, N) mirror private's visible set. Global reclaim floor = min_progress - H - 1. - auto ring capacity: compile-time exact sizing (H for private, Delta+H for shared); deterministic FATAL on overflow. - run-ahead back-pressure (PTO_DIST_RUNAHEAD): bounds Delta so the shared Delta+H window cannot overflow; cooperative drain (no thread parking, no deadlock), CLI --runahead. - overhead instrumentation (PTO_DIST_OVERHEAD): deterministic TMOPS op counts + makespan/busy/replay wall; PTO_DIST_DEPSIG dep-graph signature oracle. Verified: private == shared bit-identical dep-graph signature at 6 and 24 cores (runtime_overhead) and 72 cores (BGEMM, private side). Differential ring-vs-reference UT added. Design + quantitative overhead analysis in docs/fully_distributed_within_core.md (Sections 12.4-12.10). Co-authored-by: Cursor <cursoragent@cursor.com>
…t_engine + onboard landing design
Port the decentralized SPMD engine to the a5sim platform and share the
engine source between a2a3 and a5 builds:
- Move dist_engine.{cpp,h} to src/common/runtime/fully_distributed_within_core
and wire both a2a3/a5 build_config to the shared source (no duplication).
- Portable dist_set_local_block SFINAE helper to bridge LocalContext
block_idx (a2a3) vs s_block_idx (a5) without touching intrinsic.h.
- Align a5 runtime types (DistHandoff.global_data_base, PTO2Runtime.dist_global,
Runtime example-exec-time fields) with the a2a3 engine contract.
- a5 aicpu_executor: distributed-engine handoff (register + go/done_count).
- a5 aicore_executor: replace centralized register-polling dispatch with the
SPMD entry (dist_core_main) on each AICore worker.
- Fix a5 scheduler_cold_path: drop the task_count > 0 guard so the 0-task
SPMD path completes instead of hanging.
- Add a5 ST cases (vector_example, mix_coown); both PASS on a5sim with
DEPSIG private==shared parity.
Docs: §15 a5sim bring-up + scheduler fix; §16 A5 onboard landing design
(engine-placement restructuring + cross-core atomicity options + staged
plan); §17 Open Challenges tracking list (C1-C9).
Note: onboard (real A5, CCEC) is NOT yet implemented — the HW seam branch is
still a no-op/TODO and cannot be compiled/verified without the CCEC toolchain.
Co-authored-by: Cursor <cursoragent@cursor.com>
…es) + a5 bgemm swimlane test Add gd->runahead_max: a run-ahead bound Δ_max applied at every submit/alloc entry in BOTH private and shared modes. Before advancing to task N a core publishes core_progress[core] and waits until N - min(core_progress) <= Δ_max, cooperatively draining ready owned work meanwhile. This caps how far a fast core races the monotonic fetch_max claim cursor ahead, so laggards keep claiming their share (load balance under unequal core progress). Deadlock-free: the slowest core is never throttled, advances, and raises min_progress to release the rest. - Default Δ_max = 2*num_workers, auto-scaling per platform (144 on a2/a3, 216 on a5); must be >= num_workers or the window starves cores. PTO_DIST_RUNAHEAD=N overrides (0 disables); in shared mode it also bounds the ring append frontier. - core_progress[] is now published + zeroed in both modes (was shared-only). - Deterministic replay / dependency graph unaffected: PTO_DIST_DEPSIG is bit-identical across modes and across Δ_max values (verified a5sim + a2a3sim). - docs §12.7.3 (mechanism + per-platform defaults + oversubscription caveat), env-var table, and §13 segment map updated; --runahead help text clarified. - Add a5 benchmark_bgemm ST test (aic/aiv kernels + orchestration + test) used for swimlane / load-balance analysis on a5sim. Co-authored-by: Cursor <cursoragent@cursor.com>
概述
新增 runtime
fully_distributed_within_core:编排(orchestration)、调度(scheduling)与执行(execution)全部以 SPMD 方式运行在 AICore 自身之上,AICPU 完全不参与关键路径(仅保留 init / handoff / teardown 桩)。不存在独立调度器——每个核自行构建、拥有并执行自己的任务。权威设计见
docs/fully_distributed_within_core.md。设计支柱
cube_cursor用于 AIC-anchored,vector_cursor用于 AIV-only)扫过同一确定性 submit id 空间。本次更新
1)负载均衡策略改进——避免极度不均衡
旧的核循环是"先把私有环填满、再开始执行"。由于抢占(claim)+ 填环极快、而任务执行很慢,一个快核会把连续很多个任务一口气抢光直到填满私有环,再串行地慢慢执行;其它核因 cursor 已被推到很远而无任务可抢,造成严重负载不均衡与按序的 Head-of-Line Blocking。乱序窗口本应来自 核数 × 私有环 这个维度,但深私有环把窗口压在了单核内部。
改为 执行优先、认领其次、一次只认领一个(execute-first run-ahead):
kPrivateSlots8 → 4 缩小,把单核 run-ahead 上限收紧;乱序能力交还给核数维度。详见
docs/fully_distributed_within_core.md§6 / §6.1。2)新增每核执行泳道图(swimlane)
由于本 runtime 没有中心化 AICPU 调度器,原 L2 swimlane 采集器不适用。新增引擎内置、环境变量门控的 Chrome-trace 导出:
PTO_DIST_SWIMLANE=<path>即开启(未设为 no-op);每个被执行的(子)任务记为一个 duration 事件,按**物理 block(pid)**与 **lane AIC/AIV0/AIV1(tid)**排布,直观展示 execute-first 竞争如何把工作摊到各核(负载均衡)。aicpu_executor在所有 worker 完成后一次性导出 JSON(可直接拖入 https://ui.perfetto.dev/ )。simpler_setup/tools/dist_swimlane_render.py,把 JSON 渲染为静态 Gantt PNG。benchmark_bgemm::FullCore24(block_dim=24),填满 24 AIC + 48 AIV 共 72 条 lane,便于满核可视化。CoreCallable不携带名字,故由scene_test在捕获后把func_id → name(来自 CALLABLE spec)注入泳道 JSON,使 Perfetto 与渲染图显示GEMM/ADD而非f0/f1。3)编排/调度开销实测(runtime_overhead_test)
为测量全分布式模式"无中心调度器"的代价,引擎新增门控
PTO_DIST_SKIP_EXEC=1:置位后execute_slot跳过 incore kernel 调用(每个子任务当 0 代价瞬时完成),但保留全部 ownership/完成/frontier 簿记,循环照常终止。这样测得的片上墙钟只反映 orchestration + claim race + scheduling。新增基准
examples/a2a3/fully_distributed_within_core/runtime_overhead_test/:复用 bgemm 工作负载,扫block_dim(1 block = 1 AIC + 2 AIV),报告每档片上编排墙钟。a2a3sim(约 960 任务,中位数):结论:纯编排/调度墙钟随核数近线性增长(3→72 核约 11×),因为每个核都完整重放整段编排(SPMD)且在共享 cursor 上竞争;少核时增量很小(2 块仅 +20%),核越多越陡。这部分固定开销要靠真实 kernel 执行被多核并行摊薄来回本(本实验故意跳过执行,只暴露开销)。详见设计文档 §6.2。
包含内容
docs/fully_distributed_within_core.md— 权威设计文档(中文,含 §6.1 负载均衡论证、§6.2 开销实测)。src/{a2a3,a5}/runtime/fully_distributed_within_core/— runtime 骨架与dist_engine(execute-first 循环 + 泳道导出 + skip-exec 门控)。examples/a2a3/fully_distributed_within_core/— 迁移的示例(含FullCore24满核用例、runtime_overhead_test开销基准)。tests/st/a2a3/fully_distributed_within_core/— 迁移的 ST 测试。simpler_setup/tools/dist_swimlane_render.py— 泳道图渲染脚本。验证(a2a3sim)
benchmark_bgemm(Case0 / Bgemm64 / FullCore24)、vector_example、mix_coown(MIX 共同所有权)均 PASS。FullCore24满核运行得到 72 lane / 480 事件的泳道图,图例正确显示 GEMM/ADD。runtime_overhead_test扫 1/2/12/24 blocks 全部 PASS,得到上表的编排开销 scaling。备注 / 后续
Made with Cursor