[None][feat] Qwen-Image: NVFP4 SVDQuant (NVFP4 residual + rank-r BF16 LoRA) by jingyu-ml · Pull Request #15532 · NVIDIA/TensorRT-LLM

jingyu-ml · 2026-06-23T00:06:13Z

What does this PR do?

Type of change: New feature

Adds NVFP4 SVDQuant support for Qwen-Image in VisualGen: running from a
ModelOpt SVDQuant checkpoint (quant_algo: NVFP4_SVD), where each quantized
linear is W ≈ R + L1·L2 — an NVFP4 residual R plus a per-input-channel
pre_quant_scale smoothing and a rank-r BF16 LoRA correction
(svdquant_lora_a = L2 [r,in], svdquant_lora_b = L1 [out,r]).

Stacked on #15470 (NVFP4 static-checkpoint loading). The first commit is
that PR; review commit 54507bf for the SVDQuant-only diff, and merge #15470
first. The SVDQuant residual reuses the NVFP4 static-load path, the exclude
handling, and the relaxed key check from #15470.

Changes (the SVDQuant commit):

NVFP4SVDLinearMethod (transformer_qwen_image.py): `Y = nvfp4_gemm(quant(X̂), R)
- (X̂ @ L2ᵀ) @ L1ᵀ + bias, X̂ = X·pre_quant_scale. Subclasses the NVFP4 method for the residual; loads pre_quant_scale` + the two LoRA factors per Linear.
load_weights detects SVDQuant from svdquant_lora_a keys, swaps the method
onto the quantized Linears, and relaxes the key check for the 3 extra tensors.
config.py: NVFP4_SVD → NVFP4 in algo_map so the residual loads on the
standard static-NVFP4 path. Excluded layers (embedders/proj_out/first+last
blocks) stay BF16, same as NVFP4.
New examples/visual_gen/configs/qwen-image-svdquant-1gpu.yaml; documented in
visual-generation.md.

This is the functional path (BF16 matmuls for the LoRA). A fused FlashInfer
SVDQuant kernel for the residual+LoRA is a follow-up perf optimization.

Usage

python examples/visual_gen/models/qwen_image.py --model <qwen-image-svdquant> \
    --visual_gen_args examples/visual_gen/configs/qwen-image-svdquant-1gpu.yaml

Output samples (1328×1328, 50 steps, seed 42, same prompt)

SVDQuant quality is on par with NVFP4 and BF16:

BF16	NVFP4	NVFP4 SVDQuant (this PR)

Testing

On 1× GB200 (sm_100), TRT-LLM release container: the SVDQuant checkpoint loads
(729/729 transformer weights, no key/shape errors) and renders a coherent
1328² image visually on par with BF16/NVFP4 at the same prompt/seed. (The
residual-only output would be visibly degraded without the LoRA, so the clean
result confirms the LoRA term is applied.)

Before your PR is "Ready for review"

Backward compatible: ✅ (additive; NVFP4/BF16/dynamic paths unchanged)
New tests: ⚠️ covered by end-to-end image validation; unit coverage TODO

Summary by CodeRabbit

Release Notes

New Features
- Added NVFP4 SVDQuant support for Qwen-Image with LoRA correction
- Added Qwen-Image text-to-image generation CLI example
- Added 1-GPU configuration files for BF16, NVFP4, and SVDQuant setups
Documentation
- Enhanced Qwen-Image documentation with quantization and loading details
- Added usage examples for Qwen-Image model configurations
Tests
- Added quantization exclusion behavior test

…oints Enable VisualGen to run Qwen-Image from a statically pre-quantized ModelOpt checkpoint (NVFP4/FP8), and add the offline example + configs. Previously only dynamic quantization (BF16 -> NVFP4 at load) worked; pointing --model at a ModelOpt-exported NVFP4 checkpoint failed during weight loading. transformer_qwen_image.py: - Honor the checkpoint's quantization `ignore` list: clear quant_config on the excluded Linear modules before create_weights() so they build the unquantized method (ModelOpt stores those layers -- embedders, proj_out, norm_out, time_text_embed, first/last blocks -- in BF16). get_quant_method() selects the method purely from module.quant_config. - Relax the strict weight-key check for FP8/NVFP4 helper buffers that are derived at load time and never serialized (alpha, inv_input_scale, kv_scales, inv_kv_scales). Both changes are backward compatible with the dynamic-quant and BF16 paths. Add examples/visual_gen/models/qwen_image.py and qwen-image-{fp4,bf16}-1gpu.yaml; document static-checkpoint support in README and visual-generation.md; add a unit test for the exclusion logic. Validated on GB200 (sm_100): a static NVFP4 checkpoint loads (729/729 weights) and renders a 1328x1328 image on par with BF16; qwen registry tests pass (7/7). Signed-off-by: Jingyu Xin <jingyux@nvidia.com>

… LoRA) Builds on NVIDIA#15470 (NVFP4 static-checkpoint loading). Adds support for running Qwen-Image from a ModelOpt NVFP4 SVDQuant checkpoint (quant_algo NVFP4_SVD): W ~= R + L1.L2 with per-input-channel pre_quant_scale activation smoothing. - NVFP4SVDLinearMethod (transformer_qwen_image.py): forward = NVFP4 residual GEMM on the smoothed activation Xhat = X * pre_quant_scale, plus a rank-r BF16 LoRA term (Xhat @ svdquant_lora_a^T) @ svdquant_lora_b^T; loads pre_quant_scale and the two LoRA factors per quantized Linear. - load_weights detects SVDQuant from the checkpoint's svdquant_lora_a keys, swaps the method onto the quantized Linears, and relaxes the strict key check for the three extra tensors. The residual loads on the standard static-NVFP4 path (config.py maps NVFP4_SVD -> NVFP4); excluded layers stay BF16. - Add examples/visual_gen/configs/qwen-image-svdquant-1gpu.yaml; document NVFP4_SVD support in visual-generation.md. Functional path (BF16 matmuls for the LoRA); a fused FlashInfer SVDQuant kernel is a follow-up perf optimization. Validated on GB200: the SVDQuant checkpoint loads (729/729 weights) and renders a 1328x1328 image on par with BF16/NVFP4 at the same prompt and seed. Signed-off-by: Jingyu Xin <jingyux@nvidia.com>

coderabbitai · 2026-06-23T00:12:51Z

📝 Walkthrough

Walkthrough

Adds NVFP4 SVDQuant support to the Qwen-Image VisualGen model: introduces NVFP4SVDLinearMethod (LoRA residual correction over NVFP4), maps NVFP4_SVD in quant config parsing, wires excluded-layer clearing and SVDQuant detection into load_weights, and adds a new CLI example script, three YAML configs, and documentation updates.

Changes

Qwen-Image NVFP4 SVDQuant Support

Layer / File(s)	Summary
NVFP4SVDLinearMethod and quant config parsing `tensorrt_llm/_torch/visual_gen/models/qwen_image/transformer_qwen_image.py`, `tensorrt_llm/_torch/visual_gen/config.py`	Adds `NVFP4LinearMethod` import and `NVFP4SVDLinearMethod` subclass with `create_weights`, `load_weights_vanilla` (loading `pre_quant_scale`, `svdquant_lora_a/b`), and `apply` (residual NVFP4 GEMM plus LoRA correction). Adds `"NVFP4_SVD"` → `QuantAlgo.NVFP4` mapping in `DiffusionPipelineConfig.load_diffusion_quant_config`.
QwenImageTransformer2DModel integration `tensorrt_llm/_torch/visual_gen/models/qwen_image/transformer_qwen_image.py`	Adds `_clear_quant_config_on_excluded_layers()` to nullify `quant_config` on excluded `Linear` modules before weight creation. Updates `load_weights` to invoke this helper, skip derived-suffix keys in missing/unexpected-key validation, and post-load detect `svdquant_lora_a` keys to swap eligible modules' `quant_method` to `NVFP4SVDLinearMethod`.
Unit test for excluded-layer clearing `tests/unittest/_torch/visual_gen/test_qwen_image_registry.py`	Adds `test_static_quant_excludes_high_precision_layers`: instantiates `QwenImageTransformer2DModel` with `skip_create_weights_in_init=True`, calls `_clear_quant_config_on_excluded_layers()`, and asserts excluded submodules have `quant_config=None` while non-excluded submodules retain NVFP4 `quant_config`.
Example CLI, configs, and docs `examples/visual_gen/models/qwen_image.py`, `examples/visual_gen/configs/qwen-image-*.yaml`, `examples/visual_gen/README.md`, `docs/source/models/visual-generation.md`	Adds `qwen_image.py` CLI script with `_output_paths` helper and `main()` using `VisualGen`. Adds three 1-GPU YAML configs (BF16, NVFP4, SVDQuant). Updates examples README with Qwen-Image BF16/NVFP4 usage. Expands the `[^2]` visual-generation.md footnote with BF16 parity and ModelOpt checkpoint details.

Sequence Diagram(s)

sequenceDiagram
  participant Script as qwen_image.py
  participant VisualGen
  participant QwenImageTransformer2DModel
  participant NVFP4SVDLinearMethod

  Script->>VisualGen: VisualGen(model, visual_gen_args)
  Script->>VisualGen: generate(prompt, params)
  VisualGen->>QwenImageTransformer2DModel: load_weights(weights)
  QwenImageTransformer2DModel->>QwenImageTransformer2DModel: _clear_quant_config_on_excluded_layers()
  QwenImageTransformer2DModel->>QwenImageTransformer2DModel: create_weights() per Linear module
  QwenImageTransformer2DModel->>QwenImageTransformer2DModel: detect svdquant_lora_a keys → swap quant_method to NVFP4SVDLinearMethod
  VisualGen->>QwenImageTransformer2DModel: forward(x)
  QwenImageTransformer2DModel->>NVFP4SVDLinearMethod: apply(x, weight, bias)
  NVFP4SVDLinearMethod-->>QwenImageTransformer2DModel: residual_out + lora_correction
  QwenImageTransformer2DModel-->>VisualGen: image tensor
  VisualGen-->>Script: generated image
  Script->>Script: output.save(_output_paths(...))

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 53.85% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the main feature: adding NVFP4 SVDQuant support for Qwen-Image with specific technical details about the residual and LoRA components.
Description check	✅ Passed	The PR description comprehensively covers what the change does, the technical approach, usage example, testing validation, and backward compatibility status, though the PR Checklist at the end is not fully completed.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (2)

examples/visual_gen/models/qwen_image.py (1)
42-42: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add an explicit return type on main.

main should be annotated as -> None to match the repository’s Python typing rule.

As per coding guidelines, “Always annotate functions. Make the return type None if the function does not return anything.”
Suggested patch
-def main():
+def main() -> None:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/visual_gen/models/qwen_image.py` at line 42, The main function is
missing an explicit return type annotation. Add the return type annotation ->
None to the main function signature to indicate that the function does not
return a value. This aligns with the repository's Python typing guidelines which
require all functions to have explicit return type annotations.
Source: Coding guidelines
tests/unittest/_torch/visual_gen/test_qwen_image_registry.py (1)
96-109: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Coverage is insufficient for the load_weights() integration contract.

Line 96 tests the helper directly, but it does not verify that QwenImageTransformer2DModel.load_weights() still invokes _clear_quant_config_on_excluded_layers() before weight creation. A call-order/wiring regression could pass this test.
Coverage status: insufficient in tests/unittest/_torch/visual_gen/test_qwen_image_registry.py. Please add a follow-up integration test in this file or tests/unittest/_torch/visual_gen/test_qwen_image_load_weights.py that exercises load_weights() and asserts excluded vs non-excluded Linear.quant_config outcomes.

As per path instructions, "Act as a QA engineer reviewing test changes and coverage ... suggest concrete list file names and whether coverage is sufficient, insufficient, or needs follow-up outside the PR."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/visual_gen/test_qwen_image_registry.py` around lines 96
- 109, The current test directly exercises the
_clear_quant_config_on_excluded_layers() helper method, but does not verify that
QwenImageTransformer2DModel.load_weights() integrates and invokes this method
before weight creation, leaving a potential regression gap. Add a follow-up
integration test that calls the full load_weights() method on a
QwenImageTransformer2DModel instance and asserts that excluded layer modules
(img_in, txt_in, proj_out, norm_out.linear, transformer_blocks attn.to_q,
img_mlp projections) have their quant_config cleared to None, while non-excluded
quantized blocks retain their QuantAlgo.NVFP4 configuration. This test can be
added in the current test file or in
tests/unittest/_torch/visual_gen/test_qwen_image_load_weights.py to verify the
end-to-end integration contract.
Source: Path instructions

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/visual_gen/models/qwen_image.py`:
- Around line 1-98: The qwen_image.py file has formatting issues detected by
ruff-format that are blocking CI. Run the ruff formatter on this file to
automatically fix formatting violations, review the changes to ensure they are
correct, and commit the formatter output to your branch before merge.

In `@examples/visual_gen/README.md`:
- Around line 33-35: The README currently documents how to use the NVFP4
quantized model with a specific command, but does not include documentation for
the new SVDQuant quantization option that was added in this PR. Add a new
section or command block after the NVFP4 documentation that shows users how to
run the SVDQuant variant using the configs/qwen-image-svdquant-1gpu.yaml
configuration file, following the same format and structure as the existing
NVFP4 command to ensure consistency and discoverability.

In `@tensorrt_llm/_torch/visual_gen/models/qwen_image/transformer_qwen_image.py`:
- Around line 95-99: The svdquant_lora_a and svdquant_lora_b parameters are
being loaded directly from the checkpoint without accounting for tensor
parallelism sharding requirements. For tensor-parallel linears, svdquant_lora_a
must be sharded along the input dimension during row TP, and svdquant_lora_b
must be sharded along the output dimension during column TP. Apply the same
sharding pattern used for the base NVFP4 factors to both svdquant_lora_a and
svdquant_lora_b before assigning them as module parameters in the conditional
block where "svdquant_lora_a" is in w, and also in the similar block mentioned
at lines 115-116. This ensures the LoRA matrices align with the module's local
in_features and out_features under tensor parallelism.
- Around line 65-117: The NVFP4SVDLinearMethod class inherits
supports_nccl_symmetric_memory_window_output=True from its parent, but the apply
method modifies the output tensor by adding LoRA correction (line 116: out +
lora), which breaks the assumption that the returned tensor is the original
NCCL-window buffer expected by the symmetric-memory all-reduce path. Override
the supports_nccl_symmetric_memory_window_output class attribute to False in
NVFP4SVDLinearMethod to disable NCCL-window output support for this method,
since the output tensor is no longer the expected window buffer after the LoRA
addition.

---

Nitpick comments:
In `@examples/visual_gen/models/qwen_image.py`:
- Line 42: The main function is missing an explicit return type annotation. Add
the return type annotation -> None to the main function signature to indicate
that the function does not return a value. This aligns with the repository's
Python typing guidelines which require all functions to have explicit return
type annotations.

In `@tests/unittest/_torch/visual_gen/test_qwen_image_registry.py`:
- Around line 96-109: The current test directly exercises the
_clear_quant_config_on_excluded_layers() helper method, but does not verify that
QwenImageTransformer2DModel.load_weights() integrates and invokes this method
before weight creation, leaving a potential regression gap. Add a follow-up
integration test that calls the full load_weights() method on a
QwenImageTransformer2DModel instance and asserts that excluded layer modules
(img_in, txt_in, proj_out, norm_out.linear, transformer_blocks attn.to_q,
img_mlp projections) have their quant_config cleared to None, while non-excluded
quantized blocks retain their QuantAlgo.NVFP4 configuration. This test can be
added in the current test file or in
tests/unittest/_torch/visual_gen/test_qwen_image_load_weights.py to verify the
end-to-end integration contract.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 446d64dd-5235-4ae7-b5d4-263ac3e96b3a

📥 Commits

Reviewing files that changed from the base of the PR and between 9ed7ce4 and 54507bf.

📒 Files selected for processing (9)

docs/source/models/visual-generation.md
examples/visual_gen/README.md
examples/visual_gen/configs/qwen-image-bf16-1gpu.yaml
examples/visual_gen/configs/qwen-image-fp4-1gpu.yaml
examples/visual_gen/configs/qwen-image-svdquant-1gpu.yaml
examples/visual_gen/models/qwen_image.py
tensorrt_llm/_torch/visual_gen/config.py
tensorrt_llm/_torch/visual_gen/models/qwen_image/transformer_qwen_image.py
tests/unittest/_torch/visual_gen/test_qwen_image_registry.py

coderabbitai · 2026-06-23T00:12:54Z

+#!/usr/bin/env python3
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Qwen-Image text-to-image generation.
+
+Usage:
+    # BF16 reference (HF Hub id or local diffusers checkpoint)
+    python qwen_image.py --model Qwen/Qwen-Image
+
+    # NVFP4 (ModelOpt pre-quantized checkpoint; quantization is read from the
+    # checkpoint's transformer/config.json)
+    python qwen_image.py --model <qwen-image-nvfp4> \
+        --visual_gen_args ../configs/qwen-image-fp4-1gpu.yaml
+"""
+
+import argparse
+from pathlib import Path
+
+from tensorrt_llm import VisualGen, VisualGenArgs
+
+
+def _output_paths(output_path: str, num_images: int) -> str | list[str]:
+    if num_images == 1:
+        return output_path
+
+    path = Path(output_path)
+    return [str(path.with_name(f"{path.stem}_{idx + 1}{path.suffix}")) for idx in range(num_images)]
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Qwen-Image Text-to-Image example")
+    parser.add_argument(
+        "--model",
+        type=str,
+        default="Qwen/Qwen-Image",
+        help="Model path or HuggingFace Hub ID (BF16 base or a ModelOpt-quantized checkpoint)",
+    )
+    parser.add_argument(
+        "--visual_gen_args",
+        dest="visual_gen_args",
+        type=str,
+        default=None,
+        help="Path to YAML config (same as trtllm-serve --visual_gen_args)",
+    )
+    parser.add_argument(
+        "--prompt",
+        type=str,
+        default=(
+            "A coffee shop entrance features a chalkboard sign reading "
+            '"Qwen Coffee, $2 per cup," with a neon light beside it displaying '
+            "a steaming coffee cup, photorealistic, highly detailed"),
+        help="Text prompt for image generation",
+    )
+    parser.add_argument(
+        "--num_images_per_prompt",
+        type=int,
+        default=1,
+        help="Number of images to generate for the prompt",
+    )
+    parser.add_argument(
+        "--output_path",
+        type=str,
+        default="qwen_image_output.png",
+        help="Path to save the output image. For multiple images, an index is appended.",
+    )
+    args = parser.parse_args()
+    if args.num_images_per_prompt < 1:
+        raise ValueError("--num_images_per_prompt must be >= 1")
+
+    # Engine config from shared YAML (optional); model-specific defaults apply otherwise.
+    extra_args = VisualGenArgs.from_yaml(args.visual_gen_args) if args.visual_gen_args else None
+    visual_gen = VisualGen(model=args.model, args=extra_args)
+
+    # --- Model-specific: T2I request construction ---
+    # Start from per-model defaults (resolution, steps, guidance, seed, etc.) and set image count.
+    params = visual_gen.default_params
+    params.num_images_per_prompt = args.num_images_per_prompt
+
+    output = visual_gen.generate(inputs=args.prompt, params=params)
+
+    saved = output.save(_output_paths(args.output_path, args.num_images_per_prompt))
+    print(f"Saved: {saved}")
+
+
+if __name__ == "__main__":
+    main()


📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

Commit the formatter output to unblock CI.

Pre-commit is failing on ruff-format; this PR needs the formatter changes committed before merge.

🧰 Tools

🪛 GitHub Actions: Release Checks / 0_Pre-commit Check.txt

[error] 1-1: pre-commit hook 'ruff-format' failed (files reformatted). 2 files were reformatted by this hook; commit should include the formatting changes.

🪛 GitHub Actions: Release Checks / Pre-commit Check

[error] 1-1: pre-commit hook failed: ruff-format. The file was reformatted by ruff-format (2 files reformatted total). Commit the formatting changes or run ruff format.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/visual_gen/models/qwen_image.py` around lines 1 - 98, The qwen_image.py file has formatting issues detected by ruff-format that are blocking CI. Run the ruff formatter on this file to automatically fix formatting violations, review the changes to ensure they are correct, and commit the formatter output to your branch before merge.

Source: Pipeline failures

coderabbitai · 2026-06-23T00:12:54Z

+# Qwen-Image NVFP4: point --model at a ModelOpt-quantized checkpoint; the NVFP4
+# config is read from the checkpoint. (Use a BF16 checkpoint for the baseline.)
+python models/qwen_image.py --model <qwen-image-nvfp4> --visual_gen_args configs/qwen-image-fp4-1gpu.yaml


📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Add a matching SVDQuant usage command in the README.

This section documents BF16 + NVFP4, but the new SVDQuant path added in this PR layer is not shown. Please add a concrete command for configs/qwen-image-svdquant-1gpu.yaml so users can discover and run it directly.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/visual_gen/README.md` around lines 33 - 35, The README currently documents how to use the NVFP4 quantized model with a specific command, but does not include documentation for the new SVDQuant quantization option that was added in this PR. Add a new section or command block after the NVFP4 documentation that shows users how to run the SVDQuant variant using the configs/qwen-image-svdquant-1gpu.yaml configuration file, following the same format and structure as the existing NVFP4 command to ensure consistency and discoverability.

coderabbitai · 2026-06-23T00:12:54Z

+class NVFP4SVDLinearMethod(NVFP4LinearMethod):
+    """SVDQuant: NVFP4 residual GEMM + rank-r BF16 LoRA correction.
+
+    ModelOpt SVDQuant factorizes ``W ≈ R + L1·L2`` with per-input-channel
+    activation smoothing ``s`` (``pre_quant_scale``). With ``X̂ = X · s``, the
+    smoothed-space residual ``R`` (NVFP4) and low-rank term give::
+
+        Y = nvfp4_gemm(quant(X̂), R) · scales + (X̂ @ L2ᵀ) @ L1ᵀ  [+ bias]
+
+    where ``svdquant_lora_a`` = L2 ``[r, in]`` and ``svdquant_lora_b`` = L1
+    ``[out, r]``. The NVFP4 residual reuses the base method; this subclass adds
+    the smoothing + LoRA correction. Functional path (BF16 matmuls for the
+    LoRA); the fused FlashInfer SVDQuant kernel is a separate perf optimization.
+    """
+
+    def create_weights(self, module, in_features, out_features, bias, dtype):
+        super().create_weights(module, in_features, out_features, bias, dtype)
+        # Materialized lazily in load_weights_vanilla (rank comes from the ckpt).
+        module.svdquant_lora_a = None
+        module.svdquant_lora_b = None
+
+    def load_weights_vanilla(self, module, weights, allow_partial_loading: bool = False) -> None:
+        super().load_weights_vanilla(module, weights, allow_partial_loading)
+        w = weights[0]
+        device = module.weight.device
+        # pre_quant_scale ([in_features]) may already be loaded by the base NVFP4
+        # method on newer releases; load it here too for robustness.
+        if getattr(module, "pre_quant_scale", None) is None and "pre_quant_scale" in w:
+            module.pre_quant_scale = nn.Parameter(
+                w["pre_quant_scale"].to(device), requires_grad=False)
+        if "svdquant_lora_a" in w:
+            module.svdquant_lora_a = nn.Parameter(
+                w["svdquant_lora_a"].to(device), requires_grad=False)
+            module.svdquant_lora_b = nn.Parameter(
+                w["svdquant_lora_b"].to(device), requires_grad=False)
+
+    def apply(self, module, input, bias):
+        pqs = getattr(module, "pre_quant_scale", None)
+        x_hat = input * pqs if pqs is not None else input
+        # Residual NVFP4 GEMM on the already-smoothed activation; clear
+        # pre_quant_scale so the base method does not smooth a second time.
+        saved = getattr(module, "pre_quant_scale", None)
+        module.pre_quant_scale = None
+        try:
+            out = super().apply(module, x_hat, bias)
+        finally:
+            module.pre_quant_scale = saved
+        a = getattr(module, "svdquant_lora_a", None)
+        b = getattr(module, "svdquant_lora_b", None)
+        if a is not None and b is not None:
+            lora = torch.matmul(torch.matmul(x_hat, a.t()), b.t())
+            out = out + lora.to(out.dtype)
+        return out


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Disable NCCL-window output for SVDQuant linears.

NVFP4SVDLinearMethod inherits supports_nccl_symmetric_memory_window_output=True, but Line 116 replaces the GEMM output with out + lora, so the returned tensor is no longer the NCCL-window buffer that Linear.forward() expects on the symmetric-memory all-reduce path. Override the class flag to False for this method, or fuse the LoRA add into the window buffer before returning.

Proposed fix

class NVFP4SVDLinearMethod(NVFP4LinearMethod): + supports_nccl_symmetric_memory_window_output = False + """SVDQuant: NVFP4 residual GEMM + rank-r BF16 LoRA correction.

🧰 Tools

🪛 Ruff (0.15.18)

[error] 101-101: Function argument input is shadowing a Python builtin

(A002)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/visual_gen/models/qwen_image/transformer_qwen_image.py` around lines 65 - 117, The NVFP4SVDLinearMethod class inherits supports_nccl_symmetric_memory_window_output=True from its parent, but the apply method modifies the output tensor by adding LoRA correction (line 116: out + lora), which breaks the assumption that the returned tensor is the original NCCL-window buffer expected by the symmetric-memory all-reduce path. Override the supports_nccl_symmetric_memory_window_output class attribute to False in NVFP4SVDLinearMethod to disable NCCL-window output support for this method, since the output tensor is no longer the expected window buffer after the LoRA addition.

coderabbitai · 2026-06-23T00:12:54Z

+        if "svdquant_lora_a" in w:
+            module.svdquant_lora_a = nn.Parameter(
+                w["svdquant_lora_a"].to(device), requires_grad=False)
+            module.svdquant_lora_b = nn.Parameter(
+                w["svdquant_lora_b"].to(device), requires_grad=False)


🎯 Functional Correctness | 🟠 Major | 🏗️ Heavy lift

Shard SVDQuant LoRA factors for tensor-parallel linears.

The LoRA tensors are loaded directly from the checkpoint, but Linear stores local in_features/out_features under row/column TP. For row TP, svdquant_lora_a must be sharded on the input dimension; for column TP, svdquant_lora_b must be sharded on the output dimension. Otherwise the LoRA matmul can either shape-mismatch or add a global-output correction to a local residual. Please mirror the base NVFP4 sharding pattern when loading these factors.

Also applies to: 115-116

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/visual_gen/models/qwen_image/transformer_qwen_image.py` around lines 95 - 99, The svdquant_lora_a and svdquant_lora_b parameters are being loaded directly from the checkpoint without accounting for tensor parallelism sharding requirements. For tensor-parallel linears, svdquant_lora_a must be sharded along the input dimension during row TP, and svdquant_lora_b must be sharded along the output dimension during column TP. Apply the same sharding pattern used for the base NVFP4 factors to both svdquant_lora_a and svdquant_lora_b before assigning them as module parameters in the conditional block where "svdquant_lora_a" is in w, and also in the similar block mentioned at lines 115-116. This ensures the LoRA matrices align with the module's local in_features and out_features under tensor parallelism.

jingyu-ml added 2 commits June 17, 2026 17:47

jingyu-ml requested review from a team as code owners June 23, 2026 00:06

jingyu-ml requested review from chang-l and kaiyux June 23, 2026 00:06

github-actions Bot assigned jingyu-ml Jun 23, 2026

coderabbitai Bot reviewed Jun 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[None][feat] Qwen-Image: NVFP4 SVDQuant (NVFP4 residual + rank-r BF16 LoRA)#15532

[None][feat] Qwen-Image: NVFP4 SVDQuant (NVFP4 residual + rank-r BF16 LoRA)#15532
jingyu-ml wants to merge 2 commits into
NVIDIA:mainfrom
jingyu-ml:feat/qwen-image-svdquant

jingyu-ml commented Jun 23, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 23, 2026

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 23, 2026

Uh oh!

coderabbitai Bot Jun 23, 2026

Uh oh!

coderabbitai Bot Jun 23, 2026

Uh oh!

coderabbitai Bot Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jingyu-ml commented Jun 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Output samples (1328×1328, 50 steps, seed 42, same prompt)

Testing

Before your PR is "Ready for review"

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Jun 23, 2026

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jingyu-ml commented Jun 23, 2026 •

edited by coderabbitai Bot

Loading