Skip to content

Arm backend: Add Qwen3 VL E2E coverage#20274

Open
bdemirb wants to merge 1 commit into
pytorch:mainfrom
bdemirb:baris_mletorch-2068-qwen3-e2e-model-tests
Open

Arm backend: Add Qwen3 VL E2E coverage#20274
bdemirb wants to merge 1 commit into
pytorch:mainfrom
bdemirb:baris_mletorch-2068-qwen3-e2e-model-tests

Conversation

@bdemirb

@bdemirb bdemirb commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Add full Qwen3 VL language and vision model tests for the Arm backend in FP32 and BF16 modes.

Cover both TOSA and VGF no-quant paths, with BF16 VGF using an explicit FP profile that advertises BF16 support.

Relax the FP32 language-model tolerance to match observed TOSA reference drift for the full decoder stack.

cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell @rascani

@bdemirb bdemirb requested a review from digantdesai as a code owner June 15, 2026 08:42
@pytorch-bot

pytorch-bot Bot commented Jun 15, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20274

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 6 New Failures, 3 Pending, 2 Unrelated Failures, 5 Unclassified Failures

As of commit 128a4bc with merge base 5563ee9 (image):

NEW FAILURES - The following jobs have failed:

UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 15, 2026
@linux-foundation-easycla

linux-foundation-easycla Bot commented Jun 15, 2026

Copy link
Copy Markdown

CLA Signed
The committers listed above are authorized under a signed CLA.

  • ✅ login: bdemirb / name: Baris Demir (128a4bc)

@bdemirb

bdemirb commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator Author

@pytorchbot label "partner: arm"

@pytorch-bot pytorch-bot Bot added the partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm label Jun 15, 2026
@bdemirb

bdemirb commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator Author

@pytorchbot label "release notes: arm"

@bdemirb

bdemirb commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator Author

@claude review this PR

@pytorch-bot pytorch-bot Bot added the release notes: arm Changes to the ARM backend delegate label Jun 15, 2026
@claude

claude Bot commented Jun 15, 2026

Copy link
Copy Markdown

Claude finished @bdemirb's task in 3m 9s —— View job


Review: Qwen3 VL E2E coverage

  • Read the changed file
  • Check referenced helpers/config exist (qwen3_vl_test_config)
  • Compare against sibling Arm model tests for conventions
  • Verify pipeline API usage (tosa_extensions, tosa_spec, BF16 FP profile)
  • Post review feedback

Overall this is a clean, well-structured addition that mirrors the existing test_qwen3_vl_layers.py patterns and exercises the full text/vision stacks across TOSA FP, TOSA BF16, VGF, and VGF BF16. The API usage all checks out: tosa_extensions=["bf16"] and the explicit tosa_spec="TOSA-1.0+FP+bf16" are both valid paths through VgfPipeline/TosaPipelineFP. A few suggestions below — mostly maintainability, one correctness/flakiness concern.


1. Non-deterministic inputs + relaxed tolerances → likely flaky (most important)

All weights and inputs are random with no seed set anywhere (torch.randn, torch.randint in prepare_model_and_inputs). The PR description notes the FP32 text tolerance had to be relaxed to atol=3e-2 to "match observed TOSA reference drift" — but with unseeded random init across a full 28-layer decoder, the per-run error magnitude itself varies, so a tolerance tuned against one observed run can intermittently fail on another. None of the sibling model tests seed either, but they generally use much shorter op chains where drift is bounded.

Recommend seeding at the top of each prepare_model_and_inputs (or once at module import) to make the comparison reproducible, e.g. torch.manual_seed(0). That also lets you justify the exact tolerance value rather than padding it for the worst observed case.

Fix this →

2. Substantial duplication with test_qwen3_vl_layers.py

_make_qwen3_vl_2b_instruct_layer_config, _make_text_position_ids, _make_image_grid_thw, and _make_pixel_values are copied verbatim from test_qwen3_vl_layers.py (lines 39-75 there). _to_bfloat16_model_and_floating_inputs is also a near-identical copy of _to_bfloat16, and the Qwen3VLModelTestModule base class duplicates Qwen3VLTestModule. Since both files already share qwen3_vl_test_config.py, consider hoisting these shared helpers into that module (or a small qwen3_vl_test_common.py) and importing them in both test files. Per the repo's "no duplicated logic" guidance, this would be the highest-value cleanup.

3. Dead / inconsistent config fields on Qwen3VLModelTestCase

  • atol/rtol on the dataclass are only consumed in test_qwen3_vl_full_models_tosa_FP. The BF16 TOSA test hardcodes 1e-1, and both VGF tests ignore them entirely (no atol/rtol passed to VgfPipeline). Because the VGF cases set run_on_vulkan_runtime=False, there's no runtime numeric comparison there — so the fields being unused is harmless, but it's worth a one-line comment noting the VGF cases don't run on-device, otherwise a future reader will assume atol=1e-3 is being enforced.
  • run_on_vulkan_runtime defaults to True on the dataclass but is only ever meaningful for the VGF cases (which both set it False); it's silently ignored by the two TOSA tests. Minor, but the default invites confusion.

4. The deepstack "keep-in-graph" trick is fragile

deepstack_residual = hidden_states.new_zeros(())
for deepstack_feature in deepstack_feature_lists:
    deepstack_residual = deepstack_residual + deepstack_feature.sum() * 0
return hidden_states + deepstack_residual

The intent (keep deepstack computation in the exported graph without affecting output) is clearly commented, which is good. The risk is that * 0 + adding a zero scalar is exactly the kind of pattern a constant-folding / DCE pass can eliminate, which would silently drop the deepstack subgraph you're trying to cover. If the goal is genuinely to exercise those ops through lowering, it's worth confirming the deepstack merger ops actually survive into the TOSA graph (e.g. via an exir_op/op-count check) rather than trusting the multiply-by-zero to stick. If they're already covered by the layer-level tests, a brief comment saying so would clarify why the no-op form is acceptable here.

Minor

  • Qwen3_VL/ has no __init__.py, consistent with the sibling test_qwen3_vl_layers.py that imports from the same package, so collection should be fine — just flagging in case namespace-package collection ever bites.

Nothing here is blocking; (1) is the one I'd most want addressed before merge to avoid CI flakiness.
· baris_mletorch-2068-qwen3-e2e-model-tests

Add full Qwen3 VL language and vision model tests for the Arm
backend in FP32 and BF16 modes.

Cover both TOSA and VGF no-quant paths, with BF16 VGF
using an explicit FP profile that advertises BF16 support.

Relax the FP32 language-model tolerance to match observed TOSA
reference drift for the full decoder stack.

Signed-off-by: Baris Demir <baris.demir@arm.com>
Change-Id: I1fa6ffde632e0b252c4c19a95a854268ae01ba5a
@bdemirb bdemirb force-pushed the baris_mletorch-2068-qwen3-e2e-model-tests branch from e851ac7 to 128a4bc Compare June 15, 2026 11:21

@zingo zingo left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK to merge if tests are OK, just make sure there is no buck2 related changes needed for the added files (you can ask codex or calude about it or review the buck2 config files by hand)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. module: arm Issues related to arm backend partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm release notes: arm Changes to the ARM backend delegate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants