[None][feat] Qwen-Image: NVFP4 SVDQuant (NVFP4 residual + rank-r BF16 LoRA)#15532
[None][feat] Qwen-Image: NVFP4 SVDQuant (NVFP4 residual + rank-r BF16 LoRA)#15532jingyu-ml wants to merge 2 commits into
Conversation
…oints
Enable VisualGen to run Qwen-Image from a statically pre-quantized ModelOpt
checkpoint (NVFP4/FP8), and add the offline example + configs. Previously only
dynamic quantization (BF16 -> NVFP4 at load) worked; pointing --model at a
ModelOpt-exported NVFP4 checkpoint failed during weight loading.
transformer_qwen_image.py:
- Honor the checkpoint's quantization `ignore` list: clear quant_config on the
excluded Linear modules before create_weights() so they build the unquantized
method (ModelOpt stores those layers -- embedders, proj_out, norm_out,
time_text_embed, first/last blocks -- in BF16). get_quant_method() selects the
method purely from module.quant_config.
- Relax the strict weight-key check for FP8/NVFP4 helper buffers that are
derived at load time and never serialized (alpha, inv_input_scale, kv_scales,
inv_kv_scales).
Both changes are backward compatible with the dynamic-quant and BF16 paths.
Add examples/visual_gen/models/qwen_image.py and qwen-image-{fp4,bf16}-1gpu.yaml;
document static-checkpoint support in README and visual-generation.md; add a
unit test for the exclusion logic.
Validated on GB200 (sm_100): a static NVFP4 checkpoint loads (729/729 weights)
and renders a 1328x1328 image on par with BF16; qwen registry tests pass (7/7).
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
… LoRA) Builds on NVIDIA#15470 (NVFP4 static-checkpoint loading). Adds support for running Qwen-Image from a ModelOpt NVFP4 SVDQuant checkpoint (quant_algo NVFP4_SVD): W ~= R + L1.L2 with per-input-channel pre_quant_scale activation smoothing. - NVFP4SVDLinearMethod (transformer_qwen_image.py): forward = NVFP4 residual GEMM on the smoothed activation Xhat = X * pre_quant_scale, plus a rank-r BF16 LoRA term (Xhat @ svdquant_lora_a^T) @ svdquant_lora_b^T; loads pre_quant_scale and the two LoRA factors per quantized Linear. - load_weights detects SVDQuant from the checkpoint's svdquant_lora_a keys, swaps the method onto the quantized Linears, and relaxes the strict key check for the three extra tensors. The residual loads on the standard static-NVFP4 path (config.py maps NVFP4_SVD -> NVFP4); excluded layers stay BF16. - Add examples/visual_gen/configs/qwen-image-svdquant-1gpu.yaml; document NVFP4_SVD support in visual-generation.md. Functional path (BF16 matmuls for the LoRA); a fused FlashInfer SVDQuant kernel is a follow-up perf optimization. Validated on GB200: the SVDQuant checkpoint loads (729/729 weights) and renders a 1328x1328 image on par with BF16/NVFP4 at the same prompt and seed. Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
📝 WalkthroughWalkthroughAdds NVFP4 SVDQuant support to the Qwen-Image VisualGen model: introduces ChangesQwen-Image NVFP4 SVDQuant Support
Sequence Diagram(s)sequenceDiagram
participant Script as qwen_image.py
participant VisualGen
participant QwenImageTransformer2DModel
participant NVFP4SVDLinearMethod
Script->>VisualGen: VisualGen(model, visual_gen_args)
Script->>VisualGen: generate(prompt, params)
VisualGen->>QwenImageTransformer2DModel: load_weights(weights)
QwenImageTransformer2DModel->>QwenImageTransformer2DModel: _clear_quant_config_on_excluded_layers()
QwenImageTransformer2DModel->>QwenImageTransformer2DModel: create_weights() per Linear module
QwenImageTransformer2DModel->>QwenImageTransformer2DModel: detect svdquant_lora_a keys → swap quant_method to NVFP4SVDLinearMethod
VisualGen->>QwenImageTransformer2DModel: forward(x)
QwenImageTransformer2DModel->>NVFP4SVDLinearMethod: apply(x, weight, bias)
NVFP4SVDLinearMethod-->>QwenImageTransformer2DModel: residual_out + lora_correction
QwenImageTransformer2DModel-->>VisualGen: image tensor
VisualGen-->>Script: generated image
Script->>Script: output.save(_output_paths(...))
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (2)
examples/visual_gen/models/qwen_image.py (1)
42-42: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAdd an explicit return type on
main.
mainshould be annotated as-> Noneto match the repository’s Python typing rule.As per coding guidelines, “Always annotate functions. Make the return type
Noneif the function does not return anything.”Suggested patch
-def main(): +def main() -> None:🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/visual_gen/models/qwen_image.py` at line 42, The main function is missing an explicit return type annotation. Add the return type annotation -> None to the main function signature to indicate that the function does not return a value. This aligns with the repository's Python typing guidelines which require all functions to have explicit return type annotations.Source: Coding guidelines
tests/unittest/_torch/visual_gen/test_qwen_image_registry.py (1)
96-109: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winCoverage is insufficient for the
load_weights()integration contract.Line 96 tests the helper directly, but it does not verify that
QwenImageTransformer2DModel.load_weights()still invokes_clear_quant_config_on_excluded_layers()before weight creation. A call-order/wiring regression could pass this test.
Coverage status: insufficient intests/unittest/_torch/visual_gen/test_qwen_image_registry.py. Please add a follow-up integration test in this file ortests/unittest/_torch/visual_gen/test_qwen_image_load_weights.pythat exercisesload_weights()and asserts excluded vs non-excludedLinear.quant_configoutcomes.As per path instructions, "Act as a QA engineer reviewing test changes and coverage ... suggest concrete list file names and whether coverage is sufficient, insufficient, or needs follow-up outside the PR."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/_torch/visual_gen/test_qwen_image_registry.py` around lines 96 - 109, The current test directly exercises the _clear_quant_config_on_excluded_layers() helper method, but does not verify that QwenImageTransformer2DModel.load_weights() integrates and invokes this method before weight creation, leaving a potential regression gap. Add a follow-up integration test that calls the full load_weights() method on a QwenImageTransformer2DModel instance and asserts that excluded layer modules (img_in, txt_in, proj_out, norm_out.linear, transformer_blocks attn.to_q, img_mlp projections) have their quant_config cleared to None, while non-excluded quantized blocks retain their QuantAlgo.NVFP4 configuration. This test can be added in the current test file or in tests/unittest/_torch/visual_gen/test_qwen_image_load_weights.py to verify the end-to-end integration contract.Source: Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/visual_gen/models/qwen_image.py`:
- Around line 1-98: The qwen_image.py file has formatting issues detected by
ruff-format that are blocking CI. Run the ruff formatter on this file to
automatically fix formatting violations, review the changes to ensure they are
correct, and commit the formatter output to your branch before merge.
In `@examples/visual_gen/README.md`:
- Around line 33-35: The README currently documents how to use the NVFP4
quantized model with a specific command, but does not include documentation for
the new SVDQuant quantization option that was added in this PR. Add a new
section or command block after the NVFP4 documentation that shows users how to
run the SVDQuant variant using the configs/qwen-image-svdquant-1gpu.yaml
configuration file, following the same format and structure as the existing
NVFP4 command to ensure consistency and discoverability.
In `@tensorrt_llm/_torch/visual_gen/models/qwen_image/transformer_qwen_image.py`:
- Around line 95-99: The svdquant_lora_a and svdquant_lora_b parameters are
being loaded directly from the checkpoint without accounting for tensor
parallelism sharding requirements. For tensor-parallel linears, svdquant_lora_a
must be sharded along the input dimension during row TP, and svdquant_lora_b
must be sharded along the output dimension during column TP. Apply the same
sharding pattern used for the base NVFP4 factors to both svdquant_lora_a and
svdquant_lora_b before assigning them as module parameters in the conditional
block where "svdquant_lora_a" is in w, and also in the similar block mentioned
at lines 115-116. This ensures the LoRA matrices align with the module's local
in_features and out_features under tensor parallelism.
- Around line 65-117: The NVFP4SVDLinearMethod class inherits
supports_nccl_symmetric_memory_window_output=True from its parent, but the apply
method modifies the output tensor by adding LoRA correction (line 116: out +
lora), which breaks the assumption that the returned tensor is the original
NCCL-window buffer expected by the symmetric-memory all-reduce path. Override
the supports_nccl_symmetric_memory_window_output class attribute to False in
NVFP4SVDLinearMethod to disable NCCL-window output support for this method,
since the output tensor is no longer the expected window buffer after the LoRA
addition.
---
Nitpick comments:
In `@examples/visual_gen/models/qwen_image.py`:
- Line 42: The main function is missing an explicit return type annotation. Add
the return type annotation -> None to the main function signature to indicate
that the function does not return a value. This aligns with the repository's
Python typing guidelines which require all functions to have explicit return
type annotations.
In `@tests/unittest/_torch/visual_gen/test_qwen_image_registry.py`:
- Around line 96-109: The current test directly exercises the
_clear_quant_config_on_excluded_layers() helper method, but does not verify that
QwenImageTransformer2DModel.load_weights() integrates and invokes this method
before weight creation, leaving a potential regression gap. Add a follow-up
integration test that calls the full load_weights() method on a
QwenImageTransformer2DModel instance and asserts that excluded layer modules
(img_in, txt_in, proj_out, norm_out.linear, transformer_blocks attn.to_q,
img_mlp projections) have their quant_config cleared to None, while non-excluded
quantized blocks retain their QuantAlgo.NVFP4 configuration. This test can be
added in the current test file or in
tests/unittest/_torch/visual_gen/test_qwen_image_load_weights.py to verify the
end-to-end integration contract.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 446d64dd-5235-4ae7-b5d4-263ac3e96b3a
📒 Files selected for processing (9)
docs/source/models/visual-generation.mdexamples/visual_gen/README.mdexamples/visual_gen/configs/qwen-image-bf16-1gpu.yamlexamples/visual_gen/configs/qwen-image-fp4-1gpu.yamlexamples/visual_gen/configs/qwen-image-svdquant-1gpu.yamlexamples/visual_gen/models/qwen_image.pytensorrt_llm/_torch/visual_gen/config.pytensorrt_llm/_torch/visual_gen/models/qwen_image/transformer_qwen_image.pytests/unittest/_torch/visual_gen/test_qwen_image_registry.py
| #!/usr/bin/env python3 | ||
| # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
| """Qwen-Image text-to-image generation. | ||
|
|
||
| Usage: | ||
| # BF16 reference (HF Hub id or local diffusers checkpoint) | ||
| python qwen_image.py --model Qwen/Qwen-Image | ||
|
|
||
| # NVFP4 (ModelOpt pre-quantized checkpoint; quantization is read from the | ||
| # checkpoint's transformer/config.json) | ||
| python qwen_image.py --model <qwen-image-nvfp4> \ | ||
| --visual_gen_args ../configs/qwen-image-fp4-1gpu.yaml | ||
| """ | ||
|
|
||
| import argparse | ||
| from pathlib import Path | ||
|
|
||
| from tensorrt_llm import VisualGen, VisualGenArgs | ||
|
|
||
|
|
||
| def _output_paths(output_path: str, num_images: int) -> str | list[str]: | ||
| if num_images == 1: | ||
| return output_path | ||
|
|
||
| path = Path(output_path) | ||
| return [str(path.with_name(f"{path.stem}_{idx + 1}{path.suffix}")) for idx in range(num_images)] | ||
|
|
||
|
|
||
| def main(): | ||
| parser = argparse.ArgumentParser(description="Qwen-Image Text-to-Image example") | ||
| parser.add_argument( | ||
| "--model", | ||
| type=str, | ||
| default="Qwen/Qwen-Image", | ||
| help="Model path or HuggingFace Hub ID (BF16 base or a ModelOpt-quantized checkpoint)", | ||
| ) | ||
| parser.add_argument( | ||
| "--visual_gen_args", | ||
| dest="visual_gen_args", | ||
| type=str, | ||
| default=None, | ||
| help="Path to YAML config (same as trtllm-serve --visual_gen_args)", | ||
| ) | ||
| parser.add_argument( | ||
| "--prompt", | ||
| type=str, | ||
| default=( | ||
| "A coffee shop entrance features a chalkboard sign reading " | ||
| '"Qwen Coffee, $2 per cup," with a neon light beside it displaying ' | ||
| "a steaming coffee cup, photorealistic, highly detailed"), | ||
| help="Text prompt for image generation", | ||
| ) | ||
| parser.add_argument( | ||
| "--num_images_per_prompt", | ||
| type=int, | ||
| default=1, | ||
| help="Number of images to generate for the prompt", | ||
| ) | ||
| parser.add_argument( | ||
| "--output_path", | ||
| type=str, | ||
| default="qwen_image_output.png", | ||
| help="Path to save the output image. For multiple images, an index is appended.", | ||
| ) | ||
| args = parser.parse_args() | ||
| if args.num_images_per_prompt < 1: | ||
| raise ValueError("--num_images_per_prompt must be >= 1") | ||
|
|
||
| # Engine config from shared YAML (optional); model-specific defaults apply otherwise. | ||
| extra_args = VisualGenArgs.from_yaml(args.visual_gen_args) if args.visual_gen_args else None | ||
| visual_gen = VisualGen(model=args.model, args=extra_args) | ||
|
|
||
| # --- Model-specific: T2I request construction --- | ||
| # Start from per-model defaults (resolution, steps, guidance, seed, etc.) and set image count. | ||
| params = visual_gen.default_params | ||
| params.num_images_per_prompt = args.num_images_per_prompt | ||
|
|
||
| output = visual_gen.generate(inputs=args.prompt, params=params) | ||
|
|
||
| saved = output.save(_output_paths(args.output_path, args.num_images_per_prompt)) | ||
| print(f"Saved: {saved}") | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| main() |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win
Commit the formatter output to unblock CI.
Pre-commit is failing on ruff-format; this PR needs the formatter changes committed before merge.
🧰 Tools
🪛 GitHub Actions: Release Checks / 0_Pre-commit Check.txt
[error] 1-1: pre-commit hook 'ruff-format' failed (files reformatted). 2 files were reformatted by this hook; commit should include the formatting changes.
🪛 GitHub Actions: Release Checks / Pre-commit Check
[error] 1-1: pre-commit hook failed: ruff-format. The file was reformatted by ruff-format (2 files reformatted total). Commit the formatting changes or run ruff format.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/visual_gen/models/qwen_image.py` around lines 1 - 98, The
qwen_image.py file has formatting issues detected by ruff-format that are
blocking CI. Run the ruff formatter on this file to automatically fix formatting
violations, review the changes to ensure they are correct, and commit the
formatter output to your branch before merge.
Source: Pipeline failures
| # Qwen-Image NVFP4: point --model at a ModelOpt-quantized checkpoint; the NVFP4 | ||
| # config is read from the checkpoint. (Use a BF16 checkpoint for the baseline.) | ||
| python models/qwen_image.py --model <qwen-image-nvfp4> --visual_gen_args configs/qwen-image-fp4-1gpu.yaml |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Add a matching SVDQuant usage command in the README.
This section documents BF16 + NVFP4, but the new SVDQuant path added in this PR layer is not shown. Please add a concrete command for configs/qwen-image-svdquant-1gpu.yaml so users can discover and run it directly.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/visual_gen/README.md` around lines 33 - 35, The README currently
documents how to use the NVFP4 quantized model with a specific command, but does
not include documentation for the new SVDQuant quantization option that was
added in this PR. Add a new section or command block after the NVFP4
documentation that shows users how to run the SVDQuant variant using the
configs/qwen-image-svdquant-1gpu.yaml configuration file, following the same
format and structure as the existing NVFP4 command to ensure consistency and
discoverability.
| class NVFP4SVDLinearMethod(NVFP4LinearMethod): | ||
| """SVDQuant: NVFP4 residual GEMM + rank-r BF16 LoRA correction. | ||
|
|
||
| ModelOpt SVDQuant factorizes ``W ≈ R + L1·L2`` with per-input-channel | ||
| activation smoothing ``s`` (``pre_quant_scale``). With ``X̂ = X · s``, the | ||
| smoothed-space residual ``R`` (NVFP4) and low-rank term give:: | ||
|
|
||
| Y = nvfp4_gemm(quant(X̂), R) · scales + (X̂ @ L2ᵀ) @ L1ᵀ [+ bias] | ||
|
|
||
| where ``svdquant_lora_a`` = L2 ``[r, in]`` and ``svdquant_lora_b`` = L1 | ||
| ``[out, r]``. The NVFP4 residual reuses the base method; this subclass adds | ||
| the smoothing + LoRA correction. Functional path (BF16 matmuls for the | ||
| LoRA); the fused FlashInfer SVDQuant kernel is a separate perf optimization. | ||
| """ | ||
|
|
||
| def create_weights(self, module, in_features, out_features, bias, dtype): | ||
| super().create_weights(module, in_features, out_features, bias, dtype) | ||
| # Materialized lazily in load_weights_vanilla (rank comes from the ckpt). | ||
| module.svdquant_lora_a = None | ||
| module.svdquant_lora_b = None | ||
|
|
||
| def load_weights_vanilla(self, module, weights, allow_partial_loading: bool = False) -> None: | ||
| super().load_weights_vanilla(module, weights, allow_partial_loading) | ||
| w = weights[0] | ||
| device = module.weight.device | ||
| # pre_quant_scale ([in_features]) may already be loaded by the base NVFP4 | ||
| # method on newer releases; load it here too for robustness. | ||
| if getattr(module, "pre_quant_scale", None) is None and "pre_quant_scale" in w: | ||
| module.pre_quant_scale = nn.Parameter( | ||
| w["pre_quant_scale"].to(device), requires_grad=False) | ||
| if "svdquant_lora_a" in w: | ||
| module.svdquant_lora_a = nn.Parameter( | ||
| w["svdquant_lora_a"].to(device), requires_grad=False) | ||
| module.svdquant_lora_b = nn.Parameter( | ||
| w["svdquant_lora_b"].to(device), requires_grad=False) | ||
|
|
||
| def apply(self, module, input, bias): | ||
| pqs = getattr(module, "pre_quant_scale", None) | ||
| x_hat = input * pqs if pqs is not None else input | ||
| # Residual NVFP4 GEMM on the already-smoothed activation; clear | ||
| # pre_quant_scale so the base method does not smooth a second time. | ||
| saved = getattr(module, "pre_quant_scale", None) | ||
| module.pre_quant_scale = None | ||
| try: | ||
| out = super().apply(module, x_hat, bias) | ||
| finally: | ||
| module.pre_quant_scale = saved | ||
| a = getattr(module, "svdquant_lora_a", None) | ||
| b = getattr(module, "svdquant_lora_b", None) | ||
| if a is not None and b is not None: | ||
| lora = torch.matmul(torch.matmul(x_hat, a.t()), b.t()) | ||
| out = out + lora.to(out.dtype) | ||
| return out |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Disable NCCL-window output for SVDQuant linears.
NVFP4SVDLinearMethod inherits supports_nccl_symmetric_memory_window_output=True, but Line 116 replaces the GEMM output with out + lora, so the returned tensor is no longer the NCCL-window buffer that Linear.forward() expects on the symmetric-memory all-reduce path. Override the class flag to False for this method, or fuse the LoRA add into the window buffer before returning.
Proposed fix
class NVFP4SVDLinearMethod(NVFP4LinearMethod):
+ supports_nccl_symmetric_memory_window_output = False
+
"""SVDQuant: NVFP4 residual GEMM + rank-r BF16 LoRA correction.🧰 Tools
🪛 Ruff (0.15.18)
[error] 101-101: Function argument input is shadowing a Python builtin
(A002)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tensorrt_llm/_torch/visual_gen/models/qwen_image/transformer_qwen_image.py`
around lines 65 - 117, The NVFP4SVDLinearMethod class inherits
supports_nccl_symmetric_memory_window_output=True from its parent, but the apply
method modifies the output tensor by adding LoRA correction (line 116: out +
lora), which breaks the assumption that the returned tensor is the original
NCCL-window buffer expected by the symmetric-memory all-reduce path. Override
the supports_nccl_symmetric_memory_window_output class attribute to False in
NVFP4SVDLinearMethod to disable NCCL-window output support for this method,
since the output tensor is no longer the expected window buffer after the LoRA
addition.
| if "svdquant_lora_a" in w: | ||
| module.svdquant_lora_a = nn.Parameter( | ||
| w["svdquant_lora_a"].to(device), requires_grad=False) | ||
| module.svdquant_lora_b = nn.Parameter( | ||
| w["svdquant_lora_b"].to(device), requires_grad=False) |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | 🏗️ Heavy lift
Shard SVDQuant LoRA factors for tensor-parallel linears.
The LoRA tensors are loaded directly from the checkpoint, but Linear stores local in_features/out_features under row/column TP. For row TP, svdquant_lora_a must be sharded on the input dimension; for column TP, svdquant_lora_b must be sharded on the output dimension. Otherwise the LoRA matmul can either shape-mismatch or add a global-output correction to a local residual. Please mirror the base NVFP4 sharding pattern when loading these factors.
Also applies to: 115-116
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tensorrt_llm/_torch/visual_gen/models/qwen_image/transformer_qwen_image.py`
around lines 95 - 99, The svdquant_lora_a and svdquant_lora_b parameters are
being loaded directly from the checkpoint without accounting for tensor
parallelism sharding requirements. For tensor-parallel linears, svdquant_lora_a
must be sharded along the input dimension during row TP, and svdquant_lora_b
must be sharded along the output dimension during column TP. Apply the same
sharding pattern used for the base NVFP4 factors to both svdquant_lora_a and
svdquant_lora_b before assigning them as module parameters in the conditional
block where "svdquant_lora_a" is in w, and also in the similar block mentioned
at lines 115-116. This ensures the LoRA matrices align with the module's local
in_features and out_features under tensor parallelism.
What does this PR do?
Type of change: New feature
Adds NVFP4 SVDQuant support for Qwen-Image in VisualGen: running from a
ModelOpt SVDQuant checkpoint (
quant_algo: NVFP4_SVD), where each quantizedlinear is
W ≈ R + L1·L2— an NVFP4 residualRplus a per-input-channelpre_quant_scalesmoothing and a rank-r BF16 LoRA correction(
svdquant_lora_a= L2[r,in],svdquant_lora_b= L1[out,r]).Changes (the SVDQuant commit):
NVFP4SVDLinearMethod(transformer_qwen_image.py): `Y = nvfp4_gemm(quant(X̂), R),X̂ = X·pre_quant_scale. Subclasses the NVFP4 method for the residual; loadspre_quant_scale` + the two LoRA factors per Linear.load_weightsdetects SVDQuant fromsvdquant_lora_akeys, swaps the methodonto the quantized Linears, and relaxes the key check for the 3 extra tensors.
config.py:NVFP4_SVD → NVFP4inalgo_mapso the residual loads on thestandard static-NVFP4 path. Excluded layers (embedders/proj_out/first+last
blocks) stay BF16, same as NVFP4.
examples/visual_gen/configs/qwen-image-svdquant-1gpu.yaml; documented invisual-generation.md.This is the functional path (BF16 matmuls for the LoRA). A fused FlashInfer
SVDQuant kernel for the residual+LoRA is a follow-up perf optimization.
Usage
Output samples (1328×1328, 50 steps, seed 42, same prompt)
SVDQuant quality is on par with NVFP4 and BF16:
Testing
On 1× GB200 (sm_100), TRT-LLM release container: the SVDQuant checkpoint loads
(729/729 transformer weights, no key/shape errors) and renders a coherent
1328² image visually on par with BF16/NVFP4 at the same prompt/seed. (The
residual-only output would be visibly degraded without the LoRA, so the clean
result confirms the LoRA term is applied.)
Before your PR is "Ready for review"
Summary by CodeRabbit
Release Notes
New Features
Documentation
Tests