Skip to content

[None][feat] Qwen-Image: NVFP4 SVDQuant (NVFP4 residual + rank-r BF16 LoRA)#15532

Open
jingyu-ml wants to merge 2 commits into
NVIDIA:mainfrom
jingyu-ml:feat/qwen-image-svdquant
Open

[None][feat] Qwen-Image: NVFP4 SVDQuant (NVFP4 residual + rank-r BF16 LoRA)#15532
jingyu-ml wants to merge 2 commits into
NVIDIA:mainfrom
jingyu-ml:feat/qwen-image-svdquant

Conversation

@jingyu-ml

@jingyu-ml jingyu-ml commented Jun 23, 2026

Copy link
Copy Markdown

What does this PR do?

Type of change: New feature

Adds NVFP4 SVDQuant support for Qwen-Image in VisualGen: running from a
ModelOpt SVDQuant checkpoint (quant_algo: NVFP4_SVD), where each quantized
linear is W ≈ R + L1·L2 — an NVFP4 residual R plus a per-input-channel
pre_quant_scale smoothing and a rank-r BF16 LoRA correction
(svdquant_lora_a = L2 [r,in], svdquant_lora_b = L1 [out,r]).

Stacked on #15470 (NVFP4 static-checkpoint loading). The first commit is
that PR; review commit 54507bf for the SVDQuant-only diff, and merge #15470
first. The SVDQuant residual reuses the NVFP4 static-load path, the exclude
handling, and the relaxed key check from #15470.

Changes (the SVDQuant commit):

  • NVFP4SVDLinearMethod (transformer_qwen_image.py): `Y = nvfp4_gemm(quant(X̂), R)
    • (X̂ @ L2ᵀ) @ L1ᵀ + bias, X̂ = X·pre_quant_scale. Subclasses the NVFP4 method for the residual; loads pre_quant_scale` + the two LoRA factors per Linear.
  • load_weights detects SVDQuant from svdquant_lora_a keys, swaps the method
    onto the quantized Linears, and relaxes the key check for the 3 extra tensors.
  • config.py: NVFP4_SVD → NVFP4 in algo_map so the residual loads on the
    standard static-NVFP4 path. Excluded layers (embedders/proj_out/first+last
    blocks) stay BF16, same as NVFP4.
  • New examples/visual_gen/configs/qwen-image-svdquant-1gpu.yaml; documented in
    visual-generation.md.

This is the functional path (BF16 matmuls for the LoRA). A fused FlashInfer
SVDQuant kernel for the residual+LoRA is a follow-up perf optimization.

Usage

python examples/visual_gen/models/qwen_image.py --model <qwen-image-svdquant> \
    --visual_gen_args examples/visual_gen/configs/qwen-image-svdquant-1gpu.yaml

Output samples (1328×1328, 50 steps, seed 42, same prompt)

SVDQuant quality is on par with NVFP4 and BF16:

BF16 NVFP4 NVFP4 SVDQuant (this PR)

Testing

On 1× GB200 (sm_100), TRT-LLM release container: the SVDQuant checkpoint loads
(729/729 transformer weights, no key/shape errors) and renders a coherent
1328² image visually on par with BF16/NVFP4 at the same prompt/seed. (The
residual-only output would be visibly degraded without the LoRA, so the clean
result confirms the LoRA term is applied.)

Before your PR is "Ready for review"

  • Backward compatible: ✅ (additive; NVFP4/BF16/dynamic paths unchanged)
  • New tests: ⚠️ covered by end-to-end image validation; unit coverage TODO

Summary by CodeRabbit

Release Notes

  • New Features

    • Added NVFP4 SVDQuant support for Qwen-Image with LoRA correction
    • Added Qwen-Image text-to-image generation CLI example
    • Added 1-GPU configuration files for BF16, NVFP4, and SVDQuant setups
  • Documentation

    • Enhanced Qwen-Image documentation with quantization and loading details
    • Added usage examples for Qwen-Image model configurations
  • Tests

    • Added quantization exclusion behavior test

…oints

Enable VisualGen to run Qwen-Image from a statically pre-quantized ModelOpt
checkpoint (NVFP4/FP8), and add the offline example + configs. Previously only
dynamic quantization (BF16 -> NVFP4 at load) worked; pointing --model at a
ModelOpt-exported NVFP4 checkpoint failed during weight loading.

transformer_qwen_image.py:
- Honor the checkpoint's quantization `ignore` list: clear quant_config on the
  excluded Linear modules before create_weights() so they build the unquantized
  method (ModelOpt stores those layers -- embedders, proj_out, norm_out,
  time_text_embed, first/last blocks -- in BF16). get_quant_method() selects the
  method purely from module.quant_config.
- Relax the strict weight-key check for FP8/NVFP4 helper buffers that are
  derived at load time and never serialized (alpha, inv_input_scale, kv_scales,
  inv_kv_scales).

Both changes are backward compatible with the dynamic-quant and BF16 paths.

Add examples/visual_gen/models/qwen_image.py and qwen-image-{fp4,bf16}-1gpu.yaml;
document static-checkpoint support in README and visual-generation.md; add a
unit test for the exclusion logic.

Validated on GB200 (sm_100): a static NVFP4 checkpoint loads (729/729 weights)
and renders a 1328x1328 image on par with BF16; qwen registry tests pass (7/7).

Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
… LoRA)

Builds on NVIDIA#15470 (NVFP4 static-checkpoint loading). Adds support for running
Qwen-Image from a ModelOpt NVFP4 SVDQuant checkpoint (quant_algo NVFP4_SVD):
W ~= R + L1.L2 with per-input-channel pre_quant_scale activation smoothing.

- NVFP4SVDLinearMethod (transformer_qwen_image.py): forward = NVFP4 residual
  GEMM on the smoothed activation Xhat = X * pre_quant_scale, plus a rank-r BF16
  LoRA term (Xhat @ svdquant_lora_a^T) @ svdquant_lora_b^T; loads pre_quant_scale
  and the two LoRA factors per quantized Linear.
- load_weights detects SVDQuant from the checkpoint's svdquant_lora_a keys,
  swaps the method onto the quantized Linears, and relaxes the strict key check
  for the three extra tensors. The residual loads on the standard static-NVFP4
  path (config.py maps NVFP4_SVD -> NVFP4); excluded layers stay BF16.
- Add examples/visual_gen/configs/qwen-image-svdquant-1gpu.yaml; document
  NVFP4_SVD support in visual-generation.md.

Functional path (BF16 matmuls for the LoRA); a fused FlashInfer SVDQuant kernel
is a follow-up perf optimization.

Validated on GB200: the SVDQuant checkpoint loads (729/729 weights) and renders
a 1328x1328 image on par with BF16/NVFP4 at the same prompt and seed.

Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
@jingyu-ml jingyu-ml requested review from a team as code owners June 23, 2026 00:06
@jingyu-ml jingyu-ml requested review from chang-l and kaiyux June 23, 2026 00:06
@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Adds NVFP4 SVDQuant support to the Qwen-Image VisualGen model: introduces NVFP4SVDLinearMethod (LoRA residual correction over NVFP4), maps NVFP4_SVD in quant config parsing, wires excluded-layer clearing and SVDQuant detection into load_weights, and adds a new CLI example script, three YAML configs, and documentation updates.

Changes

Qwen-Image NVFP4 SVDQuant Support

Layer / File(s) Summary
NVFP4SVDLinearMethod and quant config parsing
tensorrt_llm/_torch/visual_gen/models/qwen_image/transformer_qwen_image.py, tensorrt_llm/_torch/visual_gen/config.py
Adds NVFP4LinearMethod import and NVFP4SVDLinearMethod subclass with create_weights, load_weights_vanilla (loading pre_quant_scale, svdquant_lora_a/b), and apply (residual NVFP4 GEMM plus LoRA correction). Adds "NVFP4_SVD"QuantAlgo.NVFP4 mapping in DiffusionPipelineConfig.load_diffusion_quant_config.
QwenImageTransformer2DModel integration
tensorrt_llm/_torch/visual_gen/models/qwen_image/transformer_qwen_image.py
Adds _clear_quant_config_on_excluded_layers() to nullify quant_config on excluded Linear modules before weight creation. Updates load_weights to invoke this helper, skip derived-suffix keys in missing/unexpected-key validation, and post-load detect svdquant_lora_a keys to swap eligible modules' quant_method to NVFP4SVDLinearMethod.
Unit test for excluded-layer clearing
tests/unittest/_torch/visual_gen/test_qwen_image_registry.py
Adds test_static_quant_excludes_high_precision_layers: instantiates QwenImageTransformer2DModel with skip_create_weights_in_init=True, calls _clear_quant_config_on_excluded_layers(), and asserts excluded submodules have quant_config=None while non-excluded submodules retain NVFP4 quant_config.
Example CLI, configs, and docs
examples/visual_gen/models/qwen_image.py, examples/visual_gen/configs/qwen-image-*.yaml, examples/visual_gen/README.md, docs/source/models/visual-generation.md
Adds qwen_image.py CLI script with _output_paths helper and main() using VisualGen. Adds three 1-GPU YAML configs (BF16, NVFP4, SVDQuant). Updates examples README with Qwen-Image BF16/NVFP4 usage. Expands the [^2] visual-generation.md footnote with BF16 parity and ModelOpt checkpoint details.

Sequence Diagram(s)

sequenceDiagram
  participant Script as qwen_image.py
  participant VisualGen
  participant QwenImageTransformer2DModel
  participant NVFP4SVDLinearMethod

  Script->>VisualGen: VisualGen(model, visual_gen_args)
  Script->>VisualGen: generate(prompt, params)
  VisualGen->>QwenImageTransformer2DModel: load_weights(weights)
  QwenImageTransformer2DModel->>QwenImageTransformer2DModel: _clear_quant_config_on_excluded_layers()
  QwenImageTransformer2DModel->>QwenImageTransformer2DModel: create_weights() per Linear module
  QwenImageTransformer2DModel->>QwenImageTransformer2DModel: detect svdquant_lora_a keys → swap quant_method to NVFP4SVDLinearMethod
  VisualGen->>QwenImageTransformer2DModel: forward(x)
  QwenImageTransformer2DModel->>NVFP4SVDLinearMethod: apply(x, weight, bias)
  NVFP4SVDLinearMethod-->>QwenImageTransformer2DModel: residual_out + lora_correction
  QwenImageTransformer2DModel-->>VisualGen: image tensor
  VisualGen-->>Script: generated image
  Script->>Script: output.save(_output_paths(...))
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 53.85% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main feature: adding NVFP4 SVDQuant support for Qwen-Image with specific technical details about the residual and LoRA components.
Description check ✅ Passed The PR description comprehensively covers what the change does, the technical approach, usage example, testing validation, and backward compatibility status, though the PR Checklist at the end is not fully completed.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (2)
examples/visual_gen/models/qwen_image.py (1)

42-42: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add an explicit return type on main.

main should be annotated as -> None to match the repository’s Python typing rule.

As per coding guidelines, “Always annotate functions. Make the return type None if the function does not return anything.”

Suggested patch
-def main():
+def main() -> None:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/visual_gen/models/qwen_image.py` at line 42, The main function is
missing an explicit return type annotation. Add the return type annotation ->
None to the main function signature to indicate that the function does not
return a value. This aligns with the repository's Python typing guidelines which
require all functions to have explicit return type annotations.

Source: Coding guidelines

tests/unittest/_torch/visual_gen/test_qwen_image_registry.py (1)

96-109: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Coverage is insufficient for the load_weights() integration contract.

Line 96 tests the helper directly, but it does not verify that QwenImageTransformer2DModel.load_weights() still invokes _clear_quant_config_on_excluded_layers() before weight creation. A call-order/wiring regression could pass this test.
Coverage status: insufficient in tests/unittest/_torch/visual_gen/test_qwen_image_registry.py. Please add a follow-up integration test in this file or tests/unittest/_torch/visual_gen/test_qwen_image_load_weights.py that exercises load_weights() and asserts excluded vs non-excluded Linear.quant_config outcomes.

As per path instructions, "Act as a QA engineer reviewing test changes and coverage ... suggest concrete list file names and whether coverage is sufficient, insufficient, or needs follow-up outside the PR."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/visual_gen/test_qwen_image_registry.py` around lines 96
- 109, The current test directly exercises the
_clear_quant_config_on_excluded_layers() helper method, but does not verify that
QwenImageTransformer2DModel.load_weights() integrates and invokes this method
before weight creation, leaving a potential regression gap. Add a follow-up
integration test that calls the full load_weights() method on a
QwenImageTransformer2DModel instance and asserts that excluded layer modules
(img_in, txt_in, proj_out, norm_out.linear, transformer_blocks attn.to_q,
img_mlp projections) have their quant_config cleared to None, while non-excluded
quantized blocks retain their QuantAlgo.NVFP4 configuration. This test can be
added in the current test file or in
tests/unittest/_torch/visual_gen/test_qwen_image_load_weights.py to verify the
end-to-end integration contract.

Source: Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/visual_gen/models/qwen_image.py`:
- Around line 1-98: The qwen_image.py file has formatting issues detected by
ruff-format that are blocking CI. Run the ruff formatter on this file to
automatically fix formatting violations, review the changes to ensure they are
correct, and commit the formatter output to your branch before merge.

In `@examples/visual_gen/README.md`:
- Around line 33-35: The README currently documents how to use the NVFP4
quantized model with a specific command, but does not include documentation for
the new SVDQuant quantization option that was added in this PR. Add a new
section or command block after the NVFP4 documentation that shows users how to
run the SVDQuant variant using the configs/qwen-image-svdquant-1gpu.yaml
configuration file, following the same format and structure as the existing
NVFP4 command to ensure consistency and discoverability.

In `@tensorrt_llm/_torch/visual_gen/models/qwen_image/transformer_qwen_image.py`:
- Around line 95-99: The svdquant_lora_a and svdquant_lora_b parameters are
being loaded directly from the checkpoint without accounting for tensor
parallelism sharding requirements. For tensor-parallel linears, svdquant_lora_a
must be sharded along the input dimension during row TP, and svdquant_lora_b
must be sharded along the output dimension during column TP. Apply the same
sharding pattern used for the base NVFP4 factors to both svdquant_lora_a and
svdquant_lora_b before assigning them as module parameters in the conditional
block where "svdquant_lora_a" is in w, and also in the similar block mentioned
at lines 115-116. This ensures the LoRA matrices align with the module's local
in_features and out_features under tensor parallelism.
- Around line 65-117: The NVFP4SVDLinearMethod class inherits
supports_nccl_symmetric_memory_window_output=True from its parent, but the apply
method modifies the output tensor by adding LoRA correction (line 116: out +
lora), which breaks the assumption that the returned tensor is the original
NCCL-window buffer expected by the symmetric-memory all-reduce path. Override
the supports_nccl_symmetric_memory_window_output class attribute to False in
NVFP4SVDLinearMethod to disable NCCL-window output support for this method,
since the output tensor is no longer the expected window buffer after the LoRA
addition.

---

Nitpick comments:
In `@examples/visual_gen/models/qwen_image.py`:
- Line 42: The main function is missing an explicit return type annotation. Add
the return type annotation -> None to the main function signature to indicate
that the function does not return a value. This aligns with the repository's
Python typing guidelines which require all functions to have explicit return
type annotations.

In `@tests/unittest/_torch/visual_gen/test_qwen_image_registry.py`:
- Around line 96-109: The current test directly exercises the
_clear_quant_config_on_excluded_layers() helper method, but does not verify that
QwenImageTransformer2DModel.load_weights() integrates and invokes this method
before weight creation, leaving a potential regression gap. Add a follow-up
integration test that calls the full load_weights() method on a
QwenImageTransformer2DModel instance and asserts that excluded layer modules
(img_in, txt_in, proj_out, norm_out.linear, transformer_blocks attn.to_q,
img_mlp projections) have their quant_config cleared to None, while non-excluded
quantized blocks retain their QuantAlgo.NVFP4 configuration. This test can be
added in the current test file or in
tests/unittest/_torch/visual_gen/test_qwen_image_load_weights.py to verify the
end-to-end integration contract.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 446d64dd-5235-4ae7-b5d4-263ac3e96b3a

📥 Commits

Reviewing files that changed from the base of the PR and between 9ed7ce4 and 54507bf.

📒 Files selected for processing (9)
  • docs/source/models/visual-generation.md
  • examples/visual_gen/README.md
  • examples/visual_gen/configs/qwen-image-bf16-1gpu.yaml
  • examples/visual_gen/configs/qwen-image-fp4-1gpu.yaml
  • examples/visual_gen/configs/qwen-image-svdquant-1gpu.yaml
  • examples/visual_gen/models/qwen_image.py
  • tensorrt_llm/_torch/visual_gen/config.py
  • tensorrt_llm/_torch/visual_gen/models/qwen_image/transformer_qwen_image.py
  • tests/unittest/_torch/visual_gen/test_qwen_image_registry.py

Comment on lines +1 to +98
#!/usr/bin/env python3
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Qwen-Image text-to-image generation.

Usage:
# BF16 reference (HF Hub id or local diffusers checkpoint)
python qwen_image.py --model Qwen/Qwen-Image

# NVFP4 (ModelOpt pre-quantized checkpoint; quantization is read from the
# checkpoint's transformer/config.json)
python qwen_image.py --model <qwen-image-nvfp4> \
--visual_gen_args ../configs/qwen-image-fp4-1gpu.yaml
"""

import argparse
from pathlib import Path

from tensorrt_llm import VisualGen, VisualGenArgs


def _output_paths(output_path: str, num_images: int) -> str | list[str]:
if num_images == 1:
return output_path

path = Path(output_path)
return [str(path.with_name(f"{path.stem}_{idx + 1}{path.suffix}")) for idx in range(num_images)]


def main():
parser = argparse.ArgumentParser(description="Qwen-Image Text-to-Image example")
parser.add_argument(
"--model",
type=str,
default="Qwen/Qwen-Image",
help="Model path or HuggingFace Hub ID (BF16 base or a ModelOpt-quantized checkpoint)",
)
parser.add_argument(
"--visual_gen_args",
dest="visual_gen_args",
type=str,
default=None,
help="Path to YAML config (same as trtllm-serve --visual_gen_args)",
)
parser.add_argument(
"--prompt",
type=str,
default=(
"A coffee shop entrance features a chalkboard sign reading "
'"Qwen Coffee, $2 per cup," with a neon light beside it displaying '
"a steaming coffee cup, photorealistic, highly detailed"),
help="Text prompt for image generation",
)
parser.add_argument(
"--num_images_per_prompt",
type=int,
default=1,
help="Number of images to generate for the prompt",
)
parser.add_argument(
"--output_path",
type=str,
default="qwen_image_output.png",
help="Path to save the output image. For multiple images, an index is appended.",
)
args = parser.parse_args()
if args.num_images_per_prompt < 1:
raise ValueError("--num_images_per_prompt must be >= 1")

# Engine config from shared YAML (optional); model-specific defaults apply otherwise.
extra_args = VisualGenArgs.from_yaml(args.visual_gen_args) if args.visual_gen_args else None
visual_gen = VisualGen(model=args.model, args=extra_args)

# --- Model-specific: T2I request construction ---
# Start from per-model defaults (resolution, steps, guidance, seed, etc.) and set image count.
params = visual_gen.default_params
params.num_images_per_prompt = args.num_images_per_prompt

output = visual_gen.generate(inputs=args.prompt, params=params)

saved = output.save(_output_paths(args.output_path, args.num_images_per_prompt))
print(f"Saved: {saved}")


if __name__ == "__main__":
main()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

Commit the formatter output to unblock CI.

Pre-commit is failing on ruff-format; this PR needs the formatter changes committed before merge.

🧰 Tools
🪛 GitHub Actions: Release Checks / 0_Pre-commit Check.txt

[error] 1-1: pre-commit hook 'ruff-format' failed (files reformatted). 2 files were reformatted by this hook; commit should include the formatting changes.

🪛 GitHub Actions: Release Checks / Pre-commit Check

[error] 1-1: pre-commit hook failed: ruff-format. The file was reformatted by ruff-format (2 files reformatted total). Commit the formatting changes or run ruff format.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/visual_gen/models/qwen_image.py` around lines 1 - 98, The
qwen_image.py file has formatting issues detected by ruff-format that are
blocking CI. Run the ruff formatter on this file to automatically fix formatting
violations, review the changes to ensure they are correct, and commit the
formatter output to your branch before merge.

Source: Pipeline failures

Comment on lines +33 to +35
# Qwen-Image NVFP4: point --model at a ModelOpt-quantized checkpoint; the NVFP4
# config is read from the checkpoint. (Use a BF16 checkpoint for the baseline.)
python models/qwen_image.py --model <qwen-image-nvfp4> --visual_gen_args configs/qwen-image-fp4-1gpu.yaml

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Add a matching SVDQuant usage command in the README.

This section documents BF16 + NVFP4, but the new SVDQuant path added in this PR layer is not shown. Please add a concrete command for configs/qwen-image-svdquant-1gpu.yaml so users can discover and run it directly.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/visual_gen/README.md` around lines 33 - 35, The README currently
documents how to use the NVFP4 quantized model with a specific command, but does
not include documentation for the new SVDQuant quantization option that was
added in this PR. Add a new section or command block after the NVFP4
documentation that shows users how to run the SVDQuant variant using the
configs/qwen-image-svdquant-1gpu.yaml configuration file, following the same
format and structure as the existing NVFP4 command to ensure consistency and
discoverability.

Comment on lines +65 to +117
class NVFP4SVDLinearMethod(NVFP4LinearMethod):
"""SVDQuant: NVFP4 residual GEMM + rank-r BF16 LoRA correction.

ModelOpt SVDQuant factorizes ``W ≈ R + L1·L2`` with per-input-channel
activation smoothing ``s`` (``pre_quant_scale``). With ``X̂ = X · s``, the
smoothed-space residual ``R`` (NVFP4) and low-rank term give::

Y = nvfp4_gemm(quant(X̂), R) · scales + (X̂ @ L2ᵀ) @ L1ᵀ [+ bias]

where ``svdquant_lora_a`` = L2 ``[r, in]`` and ``svdquant_lora_b`` = L1
``[out, r]``. The NVFP4 residual reuses the base method; this subclass adds
the smoothing + LoRA correction. Functional path (BF16 matmuls for the
LoRA); the fused FlashInfer SVDQuant kernel is a separate perf optimization.
"""

def create_weights(self, module, in_features, out_features, bias, dtype):
super().create_weights(module, in_features, out_features, bias, dtype)
# Materialized lazily in load_weights_vanilla (rank comes from the ckpt).
module.svdquant_lora_a = None
module.svdquant_lora_b = None

def load_weights_vanilla(self, module, weights, allow_partial_loading: bool = False) -> None:
super().load_weights_vanilla(module, weights, allow_partial_loading)
w = weights[0]
device = module.weight.device
# pre_quant_scale ([in_features]) may already be loaded by the base NVFP4
# method on newer releases; load it here too for robustness.
if getattr(module, "pre_quant_scale", None) is None and "pre_quant_scale" in w:
module.pre_quant_scale = nn.Parameter(
w["pre_quant_scale"].to(device), requires_grad=False)
if "svdquant_lora_a" in w:
module.svdquant_lora_a = nn.Parameter(
w["svdquant_lora_a"].to(device), requires_grad=False)
module.svdquant_lora_b = nn.Parameter(
w["svdquant_lora_b"].to(device), requires_grad=False)

def apply(self, module, input, bias):
pqs = getattr(module, "pre_quant_scale", None)
x_hat = input * pqs if pqs is not None else input
# Residual NVFP4 GEMM on the already-smoothed activation; clear
# pre_quant_scale so the base method does not smooth a second time.
saved = getattr(module, "pre_quant_scale", None)
module.pre_quant_scale = None
try:
out = super().apply(module, x_hat, bias)
finally:
module.pre_quant_scale = saved
a = getattr(module, "svdquant_lora_a", None)
b = getattr(module, "svdquant_lora_b", None)
if a is not None and b is not None:
lora = torch.matmul(torch.matmul(x_hat, a.t()), b.t())
out = out + lora.to(out.dtype)
return out

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Disable NCCL-window output for SVDQuant linears.

NVFP4SVDLinearMethod inherits supports_nccl_symmetric_memory_window_output=True, but Line 116 replaces the GEMM output with out + lora, so the returned tensor is no longer the NCCL-window buffer that Linear.forward() expects on the symmetric-memory all-reduce path. Override the class flag to False for this method, or fuse the LoRA add into the window buffer before returning.

Proposed fix
 class NVFP4SVDLinearMethod(NVFP4LinearMethod):
+    supports_nccl_symmetric_memory_window_output = False
+
     """SVDQuant: NVFP4 residual GEMM + rank-r BF16 LoRA correction.
🧰 Tools
🪛 Ruff (0.15.18)

[error] 101-101: Function argument input is shadowing a Python builtin

(A002)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/visual_gen/models/qwen_image/transformer_qwen_image.py`
around lines 65 - 117, The NVFP4SVDLinearMethod class inherits
supports_nccl_symmetric_memory_window_output=True from its parent, but the apply
method modifies the output tensor by adding LoRA correction (line 116: out +
lora), which breaks the assumption that the returned tensor is the original
NCCL-window buffer expected by the symmetric-memory all-reduce path. Override
the supports_nccl_symmetric_memory_window_output class attribute to False in
NVFP4SVDLinearMethod to disable NCCL-window output support for this method,
since the output tensor is no longer the expected window buffer after the LoRA
addition.

Comment on lines +95 to +99
if "svdquant_lora_a" in w:
module.svdquant_lora_a = nn.Parameter(
w["svdquant_lora_a"].to(device), requires_grad=False)
module.svdquant_lora_b = nn.Parameter(
w["svdquant_lora_b"].to(device), requires_grad=False)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | 🏗️ Heavy lift

Shard SVDQuant LoRA factors for tensor-parallel linears.

The LoRA tensors are loaded directly from the checkpoint, but Linear stores local in_features/out_features under row/column TP. For row TP, svdquant_lora_a must be sharded on the input dimension; for column TP, svdquant_lora_b must be sharded on the output dimension. Otherwise the LoRA matmul can either shape-mismatch or add a global-output correction to a local residual. Please mirror the base NVFP4 sharding pattern when loading these factors.

Also applies to: 115-116

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/visual_gen/models/qwen_image/transformer_qwen_image.py`
around lines 95 - 99, The svdquant_lora_a and svdquant_lora_b parameters are
being loaded directly from the checkpoint without accounting for tensor
parallelism sharding requirements. For tensor-parallel linears, svdquant_lora_a
must be sharded along the input dimension during row TP, and svdquant_lora_b
must be sharded along the output dimension during column TP. Apply the same
sharding pattern used for the base NVFP4 factors to both svdquant_lora_a and
svdquant_lora_b before assigning them as module parameters in the conditional
block where "svdquant_lora_a" is in w, and also in the similar block mentioned
at lines 115-116. This ensures the LoRA matrices align with the module's local
in_features and out_features under tensor parallelism.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant