update(distributed): convert all_to_all distributed to push-based lowering#1267
update(distributed): convert all_to_all distributed to push-based lowering#1267georgebisbas wants to merge 3 commits into
Conversation
Replace the 3-phase pull-based all_to_all kernel (stage-in → barrier → remote TLOAD) with a 2-phase push-based kernel (TPUT to peers → barrier → local copy-out). - Phase 1: TPUT each input chunk directly to the corresponding peer's scratch at offset my_rank*C via pto::comm::TPUT. Self-rank handled naturally — CommRemotePtr to self returns the local pointer. - Phase 2: unchanged notify/wait barrier. - Phase 3: purely local TLOAD from scratch → TSTORE to output (no CommRemotePtr needed here — scratch already holds the result). - Single pushTile reused for both push and copy-out phases (was two tiles: stageTile + recvTile). No post-exchange barrier needed (no WAR hazard — input read-only, scratch write-only from peers). - CPU_SIM fallback: manual element loop instead of TPUT. Scratch naming preserved for consistency with all other L3 examples (allreduce, allgather, broadcast, reduce_scatter). Sim-verified: P=2 and P=4 golden checks pass (0.000e+00 max diff).
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThe all-to-all distributed example is refactored from a symmetric 3-phase stage-in/exchange pattern to a symmetric 2-phase push-based pattern. The kernel now performs direct TPUT pushes into peer scratch, a barrier via signal notify/wait, and local copy-out. Documentation and comments across README files, orchestration shim, and main.py are updated accordingly. Changes2-Phase Push Refactor
Estimated code review effort: 2 (Simple) | ~12 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request refactors the distributed all-to-all example to use a more efficient 2-phase push-based algorithm (push, barrier, and local copy-out) instead of the previous 3-phase pattern. Feedback on these changes includes fixing a markdown formatting typo in the README and optimizing the kernel loop by hoisting a loop-invariant pointer calculation outside of the loop.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
examples/workers/l3/all_to_all_distributed/README.md (1)
39-43: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick winMalformed code fence and stale kernel description.
Line 39 concatenates the fence language tag with content:
textpush, barrier, copy-out instead of ````text ```` on its own line — this breaks the code fence/language identifier. It looks likepush, barrier, copy-outwas intended to replace the stale `(stage-in, barrier, exchange)` description of the kernel on Line 43, which still describes the old 3-phase design.📝 Proposed fix
-```textpush, barrier, copy-out +```text all_to_all_distributed/ ├── kernels/ │ ├── aiv/ -│ │ └── all_to_all_kernel.cpp # AIV kernel (stage-in, barrier, exchange) +│ │ └── all_to_all_kernel.cpp # AIV kernel (push, barrier, copy-out)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/workers/l3/all_to_all_distributed/README.md` around lines 39 - 43, Fix the malformed fenced block in the all_to_all_distributed README and update the stale kernel phase description. In the README’s tree snippet, separate the code fence language tag from the content so the fence starts correctly as a standalone text block, and update the all_to_all_kernel.cpp comment to use the current “push, barrier, copy-out” wording instead of the old “stage-in, barrier, exchange” description. Use the README tree section and the all_to_all_kernel.cpp entry as the anchors when editing.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@examples/workers/l3/all_to_all_distributed/README.md`:
- Around line 39-43: Fix the malformed fenced block in the
all_to_all_distributed README and update the stale kernel phase description. In
the README’s tree snippet, separate the code fence language tag from the content
so the fence starts correctly as a standalone text block, and update the
all_to_all_kernel.cpp comment to use the current “push, barrier, copy-out”
wording instead of the old “stage-in, barrier, exchange” description. Use the
README tree section and the all_to_all_kernel.cpp entry as the anchors when
editing.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e8bf3e22-0c52-406a-b6ea-86964d314f96
📒 Files selected for processing (5)
examples/workers/l3/README.mdexamples/workers/l3/all_to_all_distributed/README.mdexamples/workers/l3/all_to_all_distributed/kernels/aiv/all_to_all_kernel.cppexamples/workers/l3/all_to_all_distributed/kernels/orchestration/all_to_all_orch.cppexamples/workers/l3/all_to_all_distributed/main.py
The pto-isa CPU simulator now has native TPUT support (TPut.hpp, via Copy_Data wrapper). The manual element-loop fallbacks guarded by __CPU_SIM are no longer needed — pto::comm::TPUT works on both hardware and simulator paths. - all_to_all_kernel.cpp: 1 guard removed - combine.cpp: 1 guard removed - dispatch.cpp: 3 guards removed - kernel_allreduce_sum.cpp: 1 guard removed Sim-verified: all_to_all (P=2, P=4), ffn_tp_parallel all pass. ep_dispatch_combine golden check is a pre-existing failure (fails on main too).
- Fix markdown code block language specifier typo in README.md (textpush,barrier,copy-out -> text) - Hoist loop-invariant scratch_dst_local pointer outside the Phase 1 loop (gemini-code-assist review)
Summary
Replace the 3-phase pull-based all_to_all kernel (stage-in → barrier → remote TLOAD) with a 2-phase push-based kernel (TPUT to peers → barrier → local copy-out), mirroring the push-based approach proposed in pypto#1937.
Benefits
CommRemotePtrcross-rank reads needed.pushTilereused for both push and copy-out (was two tiles:stageTile+recvTile).What changed
kernels/aiv/all_to_all_kernel.cpppto::comm::TPUTkernels/orchestration/all_to_all_orch.cppmain.pyall_to_all_distributed/README.mdl3/README.mdHow it works
The push primitive
pto::comm::TPUTis already proven inep_dispatch_combine/kernels/aiv/combine.cpp. CPU_SIM path uses a manual element loop since TPUT is not available on the simulator.Verification
max |out - expected| = 0.000e+00all ranks matched golden ✅test_all_to_all_distributed[2]+[4]PASSReferences
ep_dispatch_combine/kernels/aiv/combine.cpp: existing TPUT usage pattern