Normalize TPUSH/TPOP usage and add interim onboard lane-offset bridge by yanghaoran29 · Pull Request #1265 · hw-native-sys/simpler

yanghaoran29 · 2026-07-03T09:40:18Z

Converge MIX GM FIFO tile-pipe kernels to pto-isa default semantics:

Remove redundant manual prod.record() and setRecordStatus(false) / setAllocateStatus(false) / setFreeStatus(false) from spmd_paged_attention
Rely on built-in TPUSH/TPOP/TFREE record and back-pressure; keep existing set_flag/wait_flag pipeline sync unchanged
Fix bgemm AIV GM accumulator row addressing via get_sub_block_id(args) instead of stale CCE get_subblockid()

Work around a known pinned pto-isa bug on 1C2V MIX where get_subblockid() returns 0 for both AIV lanes, breaking TILE_UP_DOWN lane offsets inside the library. Until pto-isa/dispatch fixes sub-block identity, inject GlobalContext.sub_block_id through intrinsic.h tpipe_lane_byte_offset() + setEntryOffset as an interim bridge.

Docs and validation:

Add docs/tpush-tpop.md; update aicore-kernel-programming.md
Drop spmd_paged_attention pytest skip
Onboard SmallCase1 / Case2 / Case1 golden checks pass

coderabbitai · 2026-07-03T09:40:58Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9c35a27b-cab4-4080-a456-a5d37b861728

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

Documentation is added/updated to recommend direct TPUSH/TPOP/TFREE usage and describe a get_subblockid() lane-offset workaround. A new tpipe_lane_byte_offset helper is added to two intrinsic headers and applied in kernel_bgemm.cpp and the spmd_paged_attention test kernel, which also drops manual pipe recording/status control and re-enables a previously skipped test.

Changes

TPUSH/TPOP guidance and lane-offset helper

Layer / File(s)	Summary
TPUSH/TPOP documentation and guidance `docs/tpush-tpop.md`, `CLAUDE.md`, `docs/aicore-kernel-programming.md`, `examples/a2a3/tensormap_and_ringbuffer/qwen3_14b_decode/README.md`	New doc added and existing docs updated to recommend direct TPUSH/TPOP/TFREE usage, document the get_subblockid() mismatch workaround, and cross-link to the new guidance.
Lane-offset helper `src/a2a3/.../intrinsic.h`, `src/a5/.../intrinsic.h`	Adds `tpipe_lane_byte_offset` inline helper computing per-lane byte offsets from `get_sub_block_id(args)`.
kernel_bgemm.cpp sub-block-id update `examples/a5/tensormap_and_ringbuffer/bgemm/kernels/mix/kernel_bgemm.cpp`	Changes `__aicore__` macro default and switches `subBlockIdx` computation to `get_sub_block_id(args)`.
spmd_paged_attention refactor and test re-enable `tests/st/a2a3/.../paged_attention_parallel.cpp`, `tests/st/a2a3/.../test_spmd_paged_attention.py`	Removes manual pipe `record()` calls and status toggles, adopts `tpipe_lane_byte_offset(...)` for entry offsets, and removes the test skip marker.

Estimated code review effort: 3 (Moderate) | ~25 minutes

Possibly related PRs

hw-native-sys/simpler#1155: Both PRs update docs/aicore-kernel-programming.md with the same pto-isa no-arg get_subblockid()/MIX lane-splitting pitfalls and loader .rela.text rationale.

Poem

A rabbit hops through pipes so fine,
TPUSH, TPOP — a cleaner line 🐇
No more manual records to keep,
Lane offsets computed while I sleep,
Skipped test awakes, hops back in the race,
Docs now guide with a tidier pace! 🥕

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main changes: TPUSH/TPOP normalization and the onboard lane-offset bridge work-around.
Description check	✅ Passed	The description matches the changeset and accurately describes the kernel updates, docs, and validation work.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

gemini-code-assist

Code Review

This pull request establishes recommended guidelines for using GM FIFO tile-pipes (TPUSH/TPOP/TFREE) and refactors several kernels and runtime files to align with these patterns, including removing manual record/back-pressure calls and introducing the tpipe_lane_byte_offset helper. The review feedback highlights several opportunities to prevent potential out-of-bounds memory accesses and signed integer overflows by validating that retrieved sub-block IDs and size parameters are non-negative before using them in offset calculations or casting them to unsigned types.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

examples/a5/tensormap_and_ringbuffer/bgemm/kernels/mix/kernel_bgemm.cpp (1)
164-187: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Add a lane offset to the vector pipe before TPOP get_sub_block_id(args) fixes the GM row selection, but TPOP<..., TileSplitAxis::TILE_UP_DOWN> still depends on the pipe’s sub-block ID. On MIX, get_subblockid() stays 0 for both AIV lanes, so lane 1 would read the same FIFO half as lane 0; this path needs the same mPipe.cons.setEntryOffset(tpipe_lane_byte_offset(args, VEC_M, N, sizeof(float))) treatment used elsewhere.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/a5/tensormap_and_ringbuffer/bgemm/kernels/mix/kernel_bgemm.cpp`
around lines 164 - 187, The vector FIFO pop path in kernel_bgemm.cpp still uses
the default pipe lane mapping, so both MIX lanes can read the same half of the
FIFO even though c_sub is correctly offset by get_sub_block_id(args). Update the
vec pipe setup around the TPOP call in the kernel_bgemm flow to apply the same
lane-specific entry offset used elsewhere, using
mPipe.cons.setEntryOffset(tpipe_lane_byte_offset(args, VEC_M, N, sizeof(float)))
before TPOP, and keep the fix scoped to the VecFifoTileT /
TileSplitAxis::TILE_UP_DOWN path.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@examples/a5/tensormap_and_ringbuffer/bgemm/kernels/mix/kernel_bgemm.cpp`:
- Around line 164-187: The vector FIFO pop path in kernel_bgemm.cpp still uses
the default pipe lane mapping, so both MIX lanes can read the same half of the
FIFO even though c_sub is correctly offset by get_sub_block_id(args). Update the
vec pipe setup around the TPOP call in the kernel_bgemm flow to apply the same
lane-specific entry offset used elsewhere, using
mPipe.cons.setEntryOffset(tpipe_lane_byte_offset(args, VEC_M, N, sizeof(float)))
before TPOP, and keep the fix scoped to the VecFifoTileT /
TileSplitAxis::TILE_UP_DOWN path.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 302db7f3-b8ba-40f6-b6d3-0b7a677f0b82

📥 Commits

Reviewing files that changed from the base of the PR and between fb65fe9 and 96a4d78.

📒 Files selected for processing (9)

CLAUDE.md
docs/aicore-kernel-programming.md
docs/tpush-tpop.md
examples/a2a3/tensormap_and_ringbuffer/qwen3_14b_decode/README.md
examples/a5/tensormap_and_ringbuffer/bgemm/kernels/mix/kernel_bgemm.cpp
src/a2a3/runtime/tensormap_and_ringbuffer/common/intrinsic.h
src/a5/runtime/tensormap_and_ringbuffer/common/intrinsic.h
tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention/kernels/mix/paged_attention_parallel.cpp
tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention/test_spmd_paged_attention.py

💤 Files with no reviewable changes (1)

tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention/test_spmd_paged_attention.py

Converge MIX GM FIFO tile-pipe kernels to pto-isa default semantics: - Remove redundant manual prod.record() and setRecordStatus(false) / setAllocateStatus(false) / setFreeStatus(false) from spmd_paged_attention - Rely on built-in TPUSH/TPOP/TFREE record and back-pressure; keep existing set_flag/wait_flag pipeline sync unchanged - Fix bgemm AIV GM accumulator row addressing via get_sub_block_id(args) instead of stale CCE get_subblockid() Work around a known pinned pto-isa bug on 1C2V MIX where get_subblockid() returns 0 for both AIV lanes, breaking TILE_UP_DOWN lane offsets inside the library. Until pto-isa/dispatch fixes sub-block identity, inject GlobalContext.sub_block_id through intrinsic.h tpipe_lane_byte_offset() + setEntryOffset as an interim bridge. Docs and validation: - Add docs/tpush-tpop.md; update aicore-kernel-programming.md - Drop spmd_paged_attention pytest skip - Onboard SmallCase1 / Case2 / Case1 golden checks pass

gemini-code-assist Bot reviewed Jul 3, 2026

View reviewed changes

coderabbitai Bot reviewed Jul 3, 2026

View reviewed changes

yanghaoran29 force-pushed the TPUSH branch 2 times, most recently from 667cd6c to c8525ca Compare July 3, 2026 10:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Normalize TPUSH/TPOP usage and add interim onboard lane-offset bridge#1265

Normalize TPUSH/TPOP usage and add interim onboard lane-offset bridge#1265
yanghaoran29 wants to merge 1 commit into
hw-native-sys:mainfrom
yanghaoran29:TPUSH

yanghaoran29 commented Jul 3, 2026

Uh oh!

coderabbitai Bot commented Jul 3, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

yanghaoran29 commented Jul 3, 2026

Uh oh!

coderabbitai Bot commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jul 3, 2026 •

edited

Loading