Arm backend: Add Qwen3 VL E2E coverage by bdemirb · Pull Request #20274 · pytorch/executorch

bdemirb · 2026-06-15T08:42:56Z

Add full Qwen3 VL language and vision model tests for the Arm backend in FP32 and BF16 modes.

Cover both TOSA and VGF no-quant paths, with BF16 VGF using an explicit FP profile that advertises BF16 support.

Relax the FP32 language-model tolerance to match observed TOSA reference drift for the full decoder stack.

cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell @rascani

pytorch-bot · 2026-06-15T08:42:59Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20274

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[ROCm] MI350 CI jobs will have longer queue times due to CI migration

❌ 6 New Failures, 3 Pending, 2 Unrelated Failures, 5 Unclassified Failures

As of commit 128a4bc with merge base 5563ee9 ():

NEW FAILURES - The following jobs have failed:

pull / test-mcu-cortex-m-backend / linux-job (gh)
RuntimeError: Command docker exec -t 330f8125824bef99bf09d38eefb31542f291f6e2964af0ca1a74b9dc7eec0a39 /exec failed with exit code 1
pull / test-multimodal-linux (gemma3-4b) / linux-job (gh)
RuntimeError: Command docker exec -t 3b9f494197524ed632168a81a916036aa0ae5324321325e6994e214a133aab5c /exec failed with exit code 139
pull / test-phi-3-mini-runner-linux / linux-job (gh)
RuntimeError: Command docker exec -t d8e31086ed85f5939857c7cb33fcdcaeb275c0f76611e9e3b90767d9e27af195 /exec failed with exit code 1
pull / unittest-editable / macos / macos-job (gh)
export/tests/test_target_recipes.py::TestTargetRecipes::test_resnet18_model
trunk / test-models-macos-coreml (ic4) / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1
trunk / test-models-macos-coreml (vit) / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1

UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:

Cadence Build & Test / vision-build / vision (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
Input required and not supplied: aws-region
trunk / test-models-macos-cpu (mobilebert, xnnpack-quantization-delegation) / macos-job (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1
trunk / test-models-macos-cpu (mv3, portable) / macos-job (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
RuntimeError: operator torchvision::nms does not exist
trunk / test-models-macos-cpu (mv3, xnnpack-quantization-delegation) / macos-job (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1
trunk / test-selective-build-macos (cmake) / macos-job (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

Cadence Build & Test / hifi-build / hifi4 (gh) (detected as infra flaky with no log or failing log classifier)
pull / test-arm-backend-no-driver (test_pytest_models_tosa) / linux-job (gh) (matched linux rule in flaky-rules.json)
The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2026-06-15T08:43:02Z

The committers listed above are authorized under a signed CLA.

✅ login: bdemirb / name: Baris Demir (128a4bc)

bdemirb · 2026-06-15T08:44:37Z

@pytorchbot label "partner: arm"

bdemirb · 2026-06-15T08:44:43Z

@pytorchbot label "release notes: arm"

bdemirb · 2026-06-15T08:44:51Z

@claude review this PR

claude · 2026-06-15T08:45:44Z

Claude finished @bdemirb's task in 3m 9s —— View job

Review: Qwen3 VL E2E coverage

Read the changed file
Check referenced helpers/config exist (qwen3_vl_test_config)
Compare against sibling Arm model tests for conventions
Verify pipeline API usage (tosa_extensions, tosa_spec, BF16 FP profile)
Post review feedback

Overall this is a clean, well-structured addition that mirrors the existing test_qwen3_vl_layers.py patterns and exercises the full text/vision stacks across TOSA FP, TOSA BF16, VGF, and VGF BF16. The API usage all checks out: tosa_extensions=["bf16"] and the explicit tosa_spec="TOSA-1.0+FP+bf16" are both valid paths through VgfPipeline/TosaPipelineFP. A few suggestions below — mostly maintainability, one correctness/flakiness concern.

1. Non-deterministic inputs + relaxed tolerances → likely flaky (most important)

All weights and inputs are random with no seed set anywhere (torch.randn, torch.randint in prepare_model_and_inputs). The PR description notes the FP32 text tolerance had to be relaxed to atol=3e-2 to "match observed TOSA reference drift" — but with unseeded random init across a full 28-layer decoder, the per-run error magnitude itself varies, so a tolerance tuned against one observed run can intermittently fail on another. None of the sibling model tests seed either, but they generally use much shorter op chains where drift is bounded.

Recommend seeding at the top of each prepare_model_and_inputs (or once at module import) to make the comparison reproducible, e.g. torch.manual_seed(0). That also lets you justify the exact tolerance value rather than padding it for the worst observed case.

Fix this →

2. Substantial duplication with `test_qwen3_vl_layers.py`

_make_qwen3_vl_2b_instruct_layer_config, _make_text_position_ids, _make_image_grid_thw, and _make_pixel_values are copied verbatim from test_qwen3_vl_layers.py (lines 39-75 there). _to_bfloat16_model_and_floating_inputs is also a near-identical copy of _to_bfloat16, and the Qwen3VLModelTestModule base class duplicates Qwen3VLTestModule. Since both files already share qwen3_vl_test_config.py, consider hoisting these shared helpers into that module (or a small qwen3_vl_test_common.py) and importing them in both test files. Per the repo's "no duplicated logic" guidance, this would be the highest-value cleanup.

3. Dead / inconsistent config fields on `Qwen3VLModelTestCase`

atol/rtol on the dataclass are only consumed in test_qwen3_vl_full_models_tosa_FP. The BF16 TOSA test hardcodes 1e-1, and both VGF tests ignore them entirely (no atol/rtol passed to VgfPipeline). Because the VGF cases set run_on_vulkan_runtime=False, there's no runtime numeric comparison there — so the fields being unused is harmless, but it's worth a one-line comment noting the VGF cases don't run on-device, otherwise a future reader will assume atol=1e-3 is being enforced.
run_on_vulkan_runtime defaults to True on the dataclass but is only ever meaningful for the VGF cases (which both set it False); it's silently ignored by the two TOSA tests. Minor, but the default invites confusion.

4. The deepstack "keep-in-graph" trick is fragile

deepstack_residual = hidden_states.new_zeros(())
for deepstack_feature in deepstack_feature_lists:
    deepstack_residual = deepstack_residual + deepstack_feature.sum() * 0
return hidden_states + deepstack_residual

The intent (keep deepstack computation in the exported graph without affecting output) is clearly commented, which is good. The risk is that * 0 + adding a zero scalar is exactly the kind of pattern a constant-folding / DCE pass can eliminate, which would silently drop the deepstack subgraph you're trying to cover. If the goal is genuinely to exercise those ops through lowering, it's worth confirming the deepstack merger ops actually survive into the TOSA graph (e.g. via an exir_op/op-count check) rather than trusting the multiply-by-zero to stick. If they're already covered by the layer-level tests, a brief comment saying so would clarify why the no-op form is acceptable here.

Minor

Qwen3_VL/ has no __init__.py, consistent with the sibling test_qwen3_vl_layers.py that imports from the same package, so collection should be fine — just flagging in case namespace-package collection ever bites.

Nothing here is blocking; (1) is the one I'd most want addressed before merge to avoid CI flakiness.
· baris_mletorch-2068-qwen3-e2e-model-tests

Add full Qwen3 VL language and vision model tests for the Arm backend in FP32 and BF16 modes. Cover both TOSA and VGF no-quant paths, with BF16 VGF using an explicit FP profile that advertises BF16 support. Relax the FP32 language-model tolerance to match observed TOSA reference drift for the full decoder stack. Signed-off-by: Baris Demir <baris.demir@arm.com> Change-Id: I1fa6ffde632e0b252c4c19a95a854268ae01ba5a

zingo

OK to merge if tests are OK, just make sure there is no buck2 related changes needed for the added files (you can ask codex or calude about it or review the buck2 config files by hand)

bdemirb requested a review from digantdesai as a code owner June 15, 2026 08:42

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 15, 2026

bdemirb had a problem deploying to cadence June 15, 2026 08:43 — with GitHub Actions Failure

github-actions Bot added ciflow/trunk module: arm Issues related to arm backend labels Jun 15, 2026

pytorch-bot Bot added the partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm label Jun 15, 2026

pytorch-bot Bot added the release notes: arm Changes to the ARM backend delegate label Jun 15, 2026

bdemirb force-pushed the baris_mletorch-2068-qwen3-e2e-model-tests branch from e851ac7 to 128a4bc Compare June 15, 2026 11:21

bdemirb had a problem deploying to cadence June 15, 2026 11:21 — with GitHub Actions Failure

zingo approved these changes Jun 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arm backend: Add Qwen3 VL E2E coverage#20274

Arm backend: Add Qwen3 VL E2E coverage#20274
bdemirb wants to merge 1 commit into
pytorch:mainfrom
bdemirb:baris_mletorch-2068-qwen3-e2e-model-tests

bdemirb commented Jun 15, 2026 •

edited by pytorch-bot Bot

Loading

Uh oh!

pytorch-bot Bot commented Jun 15, 2026 •

edited

Loading

Uh oh!

linux-foundation-easycla Bot commented Jun 15, 2026 •

edited

Loading

Uh oh!

bdemirb commented Jun 15, 2026

Uh oh!

bdemirb commented Jun 15, 2026

Uh oh!

bdemirb commented Jun 15, 2026

Uh oh!

claude Bot commented Jun 15, 2026 •

edited

Loading

Uh oh!

zingo left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bdemirb commented Jun 15, 2026 • edited by pytorch-bot Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20274

❗ 1 Active SEVs

❌ 6 New Failures, 3 Pending, 2 Unrelated Failures, 5 Unclassified Failures

Uh oh!

linux-foundation-easycla Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bdemirb commented Jun 15, 2026

Uh oh!

bdemirb commented Jun 15, 2026

Uh oh!

bdemirb commented Jun 15, 2026

Uh oh!

claude Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: Qwen3 VL E2E coverage

1. Non-deterministic inputs + relaxed tolerances → likely flaky (most important)

2. Substantial duplication with test_qwen3_vl_layers.py

3. Dead / inconsistent config fields on Qwen3VLModelTestCase

4. The deepstack "keep-in-graph" trick is fragile

Minor

Uh oh!

zingo left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bdemirb commented Jun 15, 2026 •

edited by pytorch-bot Bot

Loading

pytorch-bot Bot commented Jun 15, 2026 •

edited

Loading

linux-foundation-easycla Bot commented Jun 15, 2026 •

edited

Loading

claude Bot commented Jun 15, 2026 •

edited

Loading

2. Substantial duplication with `test_qwen3_vl_layers.py`

3. Dead / inconsistent config fields on `Qwen3VLModelTestCase`