Skip to content

Normalize TPUSH/TPOP usage and add interim onboard lane-offset bridge#1265

Open
yanghaoran29 wants to merge 1 commit into
hw-native-sys:mainfrom
yanghaoran29:TPUSH
Open

Normalize TPUSH/TPOP usage and add interim onboard lane-offset bridge#1265
yanghaoran29 wants to merge 1 commit into
hw-native-sys:mainfrom
yanghaoran29:TPUSH

Conversation

@yanghaoran29

Copy link
Copy Markdown
Contributor

Converge MIX GM FIFO tile-pipe kernels to pto-isa default semantics:

  • Remove redundant manual prod.record() and setRecordStatus(false) / setAllocateStatus(false) / setFreeStatus(false) from spmd_paged_attention
  • Rely on built-in TPUSH/TPOP/TFREE record and back-pressure; keep existing set_flag/wait_flag pipeline sync unchanged
  • Fix bgemm AIV GM accumulator row addressing via get_sub_block_id(args) instead of stale CCE get_subblockid()

Work around a known pinned pto-isa bug on 1C2V MIX where get_subblockid() returns 0 for both AIV lanes, breaking TILE_UP_DOWN lane offsets inside the library. Until pto-isa/dispatch fixes sub-block identity, inject GlobalContext.sub_block_id through intrinsic.h tpipe_lane_byte_offset() + setEntryOffset as an interim bridge.

Docs and validation:

  • Add docs/tpush-tpop.md; update aicore-kernel-programming.md
  • Drop spmd_paged_attention pytest skip
  • Onboard SmallCase1 / Case2 / Case1 golden checks pass

@coderabbitai

coderabbitai Bot commented Jul 3, 2026

Copy link
Copy Markdown

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9c35a27b-cab4-4080-a456-a5d37b861728

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Documentation is added/updated to recommend direct TPUSH/TPOP/TFREE usage and describe a get_subblockid() lane-offset workaround. A new tpipe_lane_byte_offset helper is added to two intrinsic headers and applied in kernel_bgemm.cpp and the spmd_paged_attention test kernel, which also drops manual pipe recording/status control and re-enables a previously skipped test.

Changes

TPUSH/TPOP guidance and lane-offset helper

Layer / File(s) Summary
TPUSH/TPOP documentation and guidance
docs/tpush-tpop.md, CLAUDE.md, docs/aicore-kernel-programming.md, examples/a2a3/tensormap_and_ringbuffer/qwen3_14b_decode/README.md
New doc added and existing docs updated to recommend direct TPUSH/TPOP/TFREE usage, document the get_subblockid() mismatch workaround, and cross-link to the new guidance.
Lane-offset helper
src/a2a3/.../intrinsic.h, src/a5/.../intrinsic.h
Adds tpipe_lane_byte_offset inline helper computing per-lane byte offsets from get_sub_block_id(args).
kernel_bgemm.cpp sub-block-id update
examples/a5/tensormap_and_ringbuffer/bgemm/kernels/mix/kernel_bgemm.cpp
Changes __aicore__ macro default and switches subBlockIdx computation to get_sub_block_id(args).
spmd_paged_attention refactor and test re-enable
tests/st/a2a3/.../paged_attention_parallel.cpp, tests/st/a2a3/.../test_spmd_paged_attention.py
Removes manual pipe record() calls and status toggles, adopts tpipe_lane_byte_offset(...) for entry offsets, and removes the test skip marker.

Estimated code review effort: 3 (Moderate) | ~25 minutes

Possibly related PRs

  • hw-native-sys/simpler#1155: Both PRs update docs/aicore-kernel-programming.md with the same pto-isa no-arg get_subblockid()/MIX lane-splitting pitfalls and loader .rela.text rationale.

Poem

A rabbit hops through pipes so fine,
TPUSH, TPOP — a cleaner line 🐇
No more manual records to keep,
Lane offsets computed while I sleep,
Skipped test awakes, hops back in the race,
Docs now guide with a tidier pace! 🥕

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main changes: TPUSH/TPOP normalization and the onboard lane-offset bridge work-around.
Description check ✅ Passed The description matches the changeset and accurately describes the kernel updates, docs, and validation work.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request establishes recommended guidelines for using GM FIFO tile-pipes (TPUSH/TPOP/TFREE) and refactors several kernels and runtime files to align with these patterns, including removing manual record/back-pressure calls and introducing the tpipe_lane_byte_offset helper. The review feedback highlights several opportunities to prevent potential out-of-bounds memory accesses and signed integer overflows by validating that retrieved sub-block IDs and size parameters are non-negative before using them in offset calculations or casting them to unsigned types.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/common/intrinsic.h Outdated
Comment thread src/a5/runtime/tensormap_and_ringbuffer/common/intrinsic.h Outdated
Comment thread examples/a5/tensormap_and_ringbuffer/bgemm/kernels/mix/kernel_bgemm.cpp Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
examples/a5/tensormap_and_ringbuffer/bgemm/kernels/mix/kernel_bgemm.cpp (1)

164-187: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Add a lane offset to the vector pipe before TPOP get_sub_block_id(args) fixes the GM row selection, but TPOP<..., TileSplitAxis::TILE_UP_DOWN> still depends on the pipe’s sub-block ID. On MIX, get_subblockid() stays 0 for both AIV lanes, so lane 1 would read the same FIFO half as lane 0; this path needs the same mPipe.cons.setEntryOffset(tpipe_lane_byte_offset(args, VEC_M, N, sizeof(float))) treatment used elsewhere.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/a5/tensormap_and_ringbuffer/bgemm/kernels/mix/kernel_bgemm.cpp`
around lines 164 - 187, The vector FIFO pop path in kernel_bgemm.cpp still uses
the default pipe lane mapping, so both MIX lanes can read the same half of the
FIFO even though c_sub is correctly offset by get_sub_block_id(args). Update the
vec pipe setup around the TPOP call in the kernel_bgemm flow to apply the same
lane-specific entry offset used elsewhere, using
mPipe.cons.setEntryOffset(tpipe_lane_byte_offset(args, VEC_M, N, sizeof(float)))
before TPOP, and keep the fix scoped to the VecFifoTileT /
TileSplitAxis::TILE_UP_DOWN path.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@examples/a5/tensormap_and_ringbuffer/bgemm/kernels/mix/kernel_bgemm.cpp`:
- Around line 164-187: The vector FIFO pop path in kernel_bgemm.cpp still uses
the default pipe lane mapping, so both MIX lanes can read the same half of the
FIFO even though c_sub is correctly offset by get_sub_block_id(args). Update the
vec pipe setup around the TPOP call in the kernel_bgemm flow to apply the same
lane-specific entry offset used elsewhere, using
mPipe.cons.setEntryOffset(tpipe_lane_byte_offset(args, VEC_M, N, sizeof(float)))
before TPOP, and keep the fix scoped to the VecFifoTileT /
TileSplitAxis::TILE_UP_DOWN path.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 302db7f3-b8ba-40f6-b6d3-0b7a677f0b82

📥 Commits

Reviewing files that changed from the base of the PR and between fb65fe9 and 96a4d78.

📒 Files selected for processing (9)
  • CLAUDE.md
  • docs/aicore-kernel-programming.md
  • docs/tpush-tpop.md
  • examples/a2a3/tensormap_and_ringbuffer/qwen3_14b_decode/README.md
  • examples/a5/tensormap_and_ringbuffer/bgemm/kernels/mix/kernel_bgemm.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/common/intrinsic.h
  • src/a5/runtime/tensormap_and_ringbuffer/common/intrinsic.h
  • tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention/kernels/mix/paged_attention_parallel.cpp
  • tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention/test_spmd_paged_attention.py
💤 Files with no reviewable changes (1)
  • tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention/test_spmd_paged_attention.py

@yanghaoran29 yanghaoran29 force-pushed the TPUSH branch 2 times, most recently from 667cd6c to c8525ca Compare July 3, 2026 10:33
Converge MIX GM FIFO tile-pipe kernels to pto-isa default semantics:

- Remove redundant manual prod.record() and setRecordStatus(false) /
  setAllocateStatus(false) / setFreeStatus(false) from spmd_paged_attention
- Rely on built-in TPUSH/TPOP/TFREE record and back-pressure; keep existing
  set_flag/wait_flag pipeline sync unchanged
- Fix bgemm AIV GM accumulator row addressing via get_sub_block_id(args)
  instead of stale CCE get_subblockid()

Work around a known pinned pto-isa bug on 1C2V MIX where get_subblockid()
returns 0 for both AIV lanes, breaking TILE_UP_DOWN lane offsets inside the
library. Until pto-isa/dispatch fixes sub-block identity, inject
GlobalContext.sub_block_id through intrinsic.h tpipe_lane_byte_offset() +
setEntryOffset as an interim bridge.

Docs and validation:
- Add docs/tpush-tpop.md; update aicore-kernel-programming.md
- Drop spmd_paged_attention pytest skip
- Onboard SmallCase1 / Case2 / Case1 golden checks pass
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant