update(distributed): convert all_to_all distributed to push-based lowering by georgebisbas · Pull Request #1267 · hw-native-sys/simpler

georgebisbas · 2026-07-03T11:55:56Z

Summary

Replace the 3-phase pull-based all_to_all kernel (stage-in → barrier → remote TLOAD) with a 2-phase push-based kernel (TPUT to peers → barrier → local copy-out), mirroring the push-based approach proposed in pypto#1937.

Benefits

Fewer phases: 2 kernel phases instead of 3 (drops the separate stage-in loop).
No remote reads: Phase 3 is purely local TLOAD/TSTORE — no CommRemotePtr cross-rank reads needed.
Fewer tiles: Single pushTile reused for both push and copy-out (was two tiles: stageTile + recvTile).
No post-exchange barrier: Push model has no WAR hazard — input is read-only, scratch is write-only from peers.
Simpler code: The kernel is 26 lines shorter and the data flow is more linear.

What changed

File	Change
`kernels/aiv/all_to_all_kernel.cpp`	Rewrite: 3-phase pull → 2-phase push using `pto::comm::TPUT`
`kernels/orchestration/all_to_all_orch.cpp`	Updated comments
`main.py`	Updated docstring and comments
`all_to_all_distributed/README.md`	Updated algorithm description
`l3/README.md`	Updated row description

How it works

Phase 1 (push):  for dest in 0..NR-1:
                   TPUT input[dest*C..] → peer dest's scratch[my_rank*C..]
                 Self-rank: CommRemotePtr to self → local pointer → TPUT is local write
Phase 2 (barrier): TNOTIFY all, TWAIT all (unchanged)
Phase 3 (copy-out): for src in 0..NR-1:
                      TLOAD(local scratch[src*C..]) → TSTORE(output[src*C..])
                    Purely local — scratch already holds the result

The push primitive pto::comm::TPUT is already proven in ep_dispatch_combine/kernels/aiv/combine.cpp. CPU_SIM path uses a manual element loop since TPUT is not available on the simulator.

Verification

Pre-commit: all 13 hooks pass (check-headers, cpplint, clang-format, ruff, pyright, markdownlint, etc.)
Sim P=2: max |out - expected| = 0.000e+00
Sim P=4: all ranks matched golden ✅
pytest: test_all_to_all_distributed[2] + [4] PASS
Hardware P=2, P=4 (pending)

References

pypto#1937: push-based all_to_all proposal by @YunjiQin
ep_dispatch_combine/kernels/aiv/combine.cpp: existing TPUT usage pattern

Replace the 3-phase pull-based all_to_all kernel (stage-in → barrier → remote TLOAD) with a 2-phase push-based kernel (TPUT to peers → barrier → local copy-out). - Phase 1: TPUT each input chunk directly to the corresponding peer's scratch at offset my_rank*C via pto::comm::TPUT. Self-rank handled naturally — CommRemotePtr to self returns the local pointer. - Phase 2: unchanged notify/wait barrier. - Phase 3: purely local TLOAD from scratch → TSTORE to output (no CommRemotePtr needed here — scratch already holds the result). - Single pushTile reused for both push and copy-out phases (was two tiles: stageTile + recvTile). No post-exchange barrier needed (no WAR hazard — input read-only, scratch write-only from peers). - CPU_SIM fallback: manual element loop instead of TPUT. Scratch naming preserved for consistency with all other L3 examples (allreduce, allgather, broadcast, reduce_scatter). Sim-verified: P=2 and P=4 golden checks pass (0.000e+00 max diff).

coderabbitai · 2026-07-03T11:56:17Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e8a5add3-afe3-49d0-9750-a3c428c4774a

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

The all-to-all distributed example is refactored from a symmetric 3-phase stage-in/exchange pattern to a symmetric 2-phase push-based pattern. The kernel now performs direct TPUT pushes into peer scratch, a barrier via signal notify/wait, and local copy-out. Documentation and comments across README files, orchestration shim, and main.py are updated accordingly.

Changes

2-Phase Push Refactor

Layer / File(s)	Summary
Kernel push and copy-out implementation `examples/workers/l3/all_to_all_distributed/kernels/aiv/all_to_all_kernel.cpp`	Phase 1 replaces staged `TLOAD`/`TSTORE` with event flags by direct per-destination `TPUT` pushes via `CommRemotePtr` into peer scratch; Phase 3 replaces remote-read exchange with local copy-out from scratch using a reused `pushTile`.
Orchestration and documentation updates `examples/workers/l3/all_to_all_distributed/kernels/orchestration/all_to_all_orch.cpp`, `examples/workers/l3/all_to_all_distributed/README.md`, `examples/workers/l3/README.md`, `examples/workers/l3/all_to_all_distributed/main.py`	Comments and documentation are updated to describe the push-based 2-phase (push, barrier, copy-out) pattern in place of the prior 3-phase mesh/staging description, including scratch buffer semantics.

Estimated code review effort: 2 (Simple) | ~12 minutes

Possibly related PRs

hw-native-sys/simpler#888: Prior implementation of the same all-to-all kernel and orchestration codepaths using the 3-phase stage/exchange approach that this PR replaces.

Poem

A rabbit push, no need to wait,
Straight to your burrow, chunk by chunk, direct fate. 🐇
Barrier bells ring, all ranks in tune,
Then copy-out home beneath the moon.
Two phases now where three once stood—
Simpler warrens, understood!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title clearly summarizes the main change from pull-based to push-based all_to_all lowering.
Description check	✅ Passed	The description is directly related to the changeset and accurately describes the push-based all_to_all rewrite.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

gemini-code-assist

Code Review

This pull request refactors the distributed all-to-all example to use a more efficient 2-phase push-based algorithm (push, barrier, and local copy-out) instead of the previous 3-phase pattern. Feedback on these changes includes fixing a markdown formatting typo in the README and optimizing the kernel loop by hoisting a loop-invariant pointer calculation outside of the loop.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

examples/workers/l3/all_to_all_distributed/README.md (1)
39-43: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Malformed code fence and stale kernel description.

Line 39 concatenates the fence language tag with content: textpush, barrier, copy-out instead of ```` text ```` on its own line — this breaks the code fence/language identifier. It looks like push, barrier, copy-out was intended to replace the stale `(stage-in, barrier, exchange)` description of the kernel on Line 43, which still describes the old 3-phase design.
📝 Proposed fix
-```textpush, barrier, copy-out
+```text
 all_to_all_distributed/
 ├── kernels/
 │   ├── aiv/
-│   │   └── all_to_all_kernel.cpp     # AIV kernel (stage-in, barrier, exchange)
+│   │   └── all_to_all_kernel.cpp     # AIV kernel (push, barrier, copy-out)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/workers/l3/all_to_all_distributed/README.md` around lines 39 - 43,
Fix the malformed fenced block in the all_to_all_distributed README and update
the stale kernel phase description. In the README’s tree snippet, separate the
code fence language tag from the content so the fence starts correctly as a
standalone text block, and update the all_to_all_kernel.cpp comment to use the
current “push, barrier, copy-out” wording instead of the old “stage-in, barrier,
exchange” description. Use the README tree section and the all_to_all_kernel.cpp
entry as the anchors when editing.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@examples/workers/l3/all_to_all_distributed/README.md`:
- Around line 39-43: Fix the malformed fenced block in the
all_to_all_distributed README and update the stale kernel phase description. In
the README’s tree snippet, separate the code fence language tag from the content
so the fence starts correctly as a standalone text block, and update the
all_to_all_kernel.cpp comment to use the current “push, barrier, copy-out”
wording instead of the old “stage-in, barrier, exchange” description. Use the
README tree section and the all_to_all_kernel.cpp entry as the anchors when
editing.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e8bf3e22-0c52-406a-b6ea-86964d314f96

📥 Commits

Reviewing files that changed from the base of the PR and between fb65fe9 and 301fcde.

📒 Files selected for processing (5)

examples/workers/l3/README.md
examples/workers/l3/all_to_all_distributed/README.md
examples/workers/l3/all_to_all_distributed/kernels/aiv/all_to_all_kernel.cpp
examples/workers/l3/all_to_all_distributed/kernels/orchestration/all_to_all_orch.cpp
examples/workers/l3/all_to_all_distributed/main.py

The pto-isa CPU simulator now has native TPUT support (TPut.hpp, via Copy_Data wrapper). The manual element-loop fallbacks guarded by __CPU_SIM are no longer needed — pto::comm::TPUT works on both hardware and simulator paths. - all_to_all_kernel.cpp: 1 guard removed - combine.cpp: 1 guard removed - dispatch.cpp: 3 guards removed - kernel_allreduce_sum.cpp: 1 guard removed Sim-verified: all_to_all (P=2, P=4), ffn_tp_parallel all pass. ep_dispatch_combine golden check is a pre-existing failure (fails on main too).

- Fix markdown code block language specifier typo in README.md (textpush,barrier,copy-out -> text) - Hoist loop-invariant scratch_dst_local pointer outside the Phase 1 loop (gemini-code-assist review)

gemini-code-assist Bot reviewed Jul 3, 2026

View reviewed changes

Comment thread examples/workers/l3/all_to_all_distributed/README.md Outdated

Comment thread examples/workers/l3/all_to_all_distributed/kernels/aiv/all_to_all_kernel.cpp

coderabbitai Bot reviewed Jul 3, 2026

View reviewed changes

georgebisbas changed the title ~~Update: convert all_to_all distributed to push-based lowering~~ update(distributed): convert all_to_all distributed to push-based lowering Jul 3, 2026

fix(pr): resolve review comments for hw-native-sys#1267

d582ae2

- Fix markdown code block language specifier typo in README.md (textpush,barrier,copy-out -> text) - Hoist loop-invariant scratch_dst_local pointer outside the Phase 1 loop (gemini-code-assist review)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

update(distributed): convert all_to_all distributed to push-based lowering#1267

update(distributed): convert all_to_all distributed to push-based lowering#1267
georgebisbas wants to merge 3 commits into
hw-native-sys:mainfrom
georgebisbas:perf/all-to-all-push-based

georgebisbas commented Jul 3, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jul 3, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

georgebisbas commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benefits

What changed

How it works

Verification

References

Uh oh!

coderabbitai Bot commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

georgebisbas commented Jul 3, 2026 •

edited

Loading

coderabbitai Bot commented Jul 3, 2026 •

edited

Loading