Skip to content

Guard large-head nonpad Attention MEA dispatch#29140

Open
Kevin-Li-2025 wants to merge 1 commit into
microsoft:mainfrom
Kevin-Li-2025:kevin/guard-large-head-nonpad-mea
Open

Guard large-head nonpad Attention MEA dispatch#29140
Kevin-Li-2025 wants to merge 1 commit into
microsoft:mainfrom
Kevin-Li-2025:kevin/guard-large-head-nonpad-mea

Conversation

@Kevin-Li-2025

Copy link
Copy Markdown

Description

Fixes #28388.

The ONNX Attention CUDA path currently allows Memory Efficient Attention for the nonpad_kv_seqlen external-cache path with large head sizes. That path uses the CUTLASS custom right-padding variant, which can exceed the dynamic shared-memory opt-in limit on smaller architectures for head_size > 256 and crash instead of falling back.

This keeps MEA available for the normal path, but makes nonpad_kv_seqlen != nullptr && head_size > 256 fall through to the unified unfused path, which already supports large head sizes.

Tests

  • python3 -m py_compile onnxruntime/test/python/transformers/test_onnx_attention/test_gqa.py
  • git diff --check

I also attempted the targeted pytest locally:

python3 -m pytest -q onnxruntime/test/python/transformers/test_onnx_attention/test_gqa.py -k large_head_nonpad_seqlen_falls_back_from_mea_fp16

but local collection is blocked by a missing parameterized package before reaching ORT/CUDA execution.

Signed-off-by: Kevin-Li-2025 <2242139@qq.com>
@Kevin-Li-2025

Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ONNX Attention MEA crashes with nonpad_kv_seqlen and head_size > 256

1 participant