Qualcomm AI Engine Direct - Support simple_eval in calibration, perpl… #12958

winskuo-quic · 2025-07-29T15:20:30Z

Summary

Enable Perplexity Evaluation on device with llama.py
Evaluate perplexity after qdq cpu
Enable quantization to use simple_eval as calibration dataset.
Enable UT to check perplexity for QWEN, which should be more reliable than checking the string output.

Will have a follow up PR to address:

External CI enablement for qwen on x86 (If it does not take too long).
Hide Logits scale/offset to metadata in model

Script

python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s $DEVICE -m SM8750 --prompt "What is 1+1?" --temperature 0 --model_mode kv --max_seq_len 1024 --ptq 16a8w --decoder_model qwen2_5 --eval_perplexity --tasks wikitext

Test plan

python backends/qualcomm/tests/test_qnn_delegate.py -k TestExampleLLMScript.test_static_qwen2_5 --model SM8650 --build_folder build-android/ --executorch_root . -s $DEVICE

Author: @shewu-quic, @winskuo-quic

pytorch-bot · 2025-07-29T15:20:34Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12958

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures

As of commit 7a1a1d3 with merge base 9e00a51 ():

NEW FAILURES - The following jobs have failed:

Build documentation / build (buck2) / Build doc (gh)
At least one of the pre-conditions you specified did not hold
pull / test-llama-lora-linux / linux-job (gh)
RuntimeError: Command docker exec -t b2f338a83f56acf13245faf433deba869b8161c23361d64cdc31e07719c86d6f /exec failed with exit code 127
pull / unittest-arm-backend-with-no-fvp (test_pytest_ops) / linux-job (gh)
RuntimeError: Command docker exec -t ca5014232b306b9a677bd20c28862d505d1b72135ce8c7586fd76d344241c076 /exec failed with exit code 128

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2025-07-29T15:21:10Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

winskuo-quic · 2025-07-30T01:19:47Z

Hi @cccclai,
We are interested in using a more robust way to evaluate the accuracy of QNN model.
Before, we only ensure the first few characters matches the golden, is rather unsafe and unstable.
This PR will evaluate e2e flow for the 3 phases: prepare_pt2e (cpu fp32), convert_pt2e(cpu qdq), QNN on device.
At the end, we use QNN ppl score to check if there's regression.
We would like to add this test to ExecuTorch's CI, using x86 emulator. However, this could probably take a couple hours (estimate 3-6 hours) to finish with x86 emulator. Do you know if there are any concerns with the CI time limit?
Thanks

cccclai · 2025-07-30T05:41:40Z

Hi @cccclai, We are interested in using a more robust way to evaluate the accuracy of QNN model. Before, we only ensure the first few characters matches the golden, is rather unsafe and unstable. This PR will evaluate e2e flow for the 3 phases: prepare_pt2e (cpu fp32), convert_pt2e(cpu qdq), QNN on device. At the end, we use QNN ppl score to check if there's regression. We would like to add this test to ExecuTorch's CI, using x86 emulator. However, this could probably take a couple hours (estimate 3-6 hours) to finish with x86 emulator. Do you know if there are any concerns with the CI time limit? Thanks

There are some jobs running less frequently like https://github.com/pytorch/executorch/blob/b6b7a16df5e7852d976d8c34c8a7e9a1b6f7d005/.github/workflows/periodic.yml and https://github.com/pytorch/executorch/blob/b6b7a16df5e7852d976d8c34c8a7e9a1b6f7d005/.github/workflows/nightly.yml they won't block CI. Maybe we can use these jobs

winskuo-quic · 2025-07-31T04:25:07Z

Hi @cccclai, We are interested in using a more robust way to evaluate the accuracy of QNN model. Before, we only ensure the first few characters matches the golden, is rather unsafe and unstable. This PR will evaluate e2e flow for the 3 phases: prepare_pt2e (cpu fp32), convert_pt2e(cpu qdq), QNN on device. At the end, we use QNN ppl score to check if there's regression. We would like to add this test to ExecuTorch's CI, using x86 emulator. However, this could probably take a couple hours (estimate 3-6 hours) to finish with x86 emulator. Do you know if there are any concerns with the CI time limit? Thanks

There are some jobs running less frequently like https://github.com/pytorch/executorch/blob/b6b7a16df5e7852d976d8c34c8a7e9a1b6f7d005/.github/workflows/periodic.yml and https://github.com/pytorch/executorch/blob/b6b7a16df5e7852d976d8c34c8a7e9a1b6f7d005/.github/workflows/nightly.yml they won't block CI. Maybe we can use these jobs

Sure! We will support the CI in the future PR under these yml files. Thanks

facebook-github-bot · 2025-07-31T15:48:18Z

@cccclai has imported this pull request. If you are a Meta employee, you can view this in D79354902.

…exity, and UT

haowhsu-quic · 2025-08-02T00:59:54Z

Hi @cccclai,

There are some incoming PRs would like to let you know:

16bit kv cache IO in runtime
documentation of llam3_1B/3B, qwen2_0.5B/1.5B e2e with instructions & metrics (we've verified it and get pretty good result)
Model enablement for qwen3_0.6B/1.7B, smollm2_135m_instruct, phi4_mini_instruct, gemma3. (Accuracy will be addressed in other PRs)

Thank you.

winskuo-quic · 2025-08-02T07:16:23Z

Hi @cccclai,
This PR is now ready to review. Thanks!
For this PR, with the following script, we are expecting to get ppl score of 19 for QNN on device, which is slightly higher than prepare_pt2e and convert_pt2e (both around 12). With this PR, you should be able to see some reasonable outputs but still not the best. The token rate should be somewhere around 125-130 tok/sec on SM8750.
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s $DEVICE -m SM8750 --prompt "What is 1+1?" --temperature 0 --model_mode kv --max_seq_len 1024 --ptq 16a8w --decoder_model qwen2_5 --eval_perplexity --tasks wikitext --limit 1

After this PR is merged, we will push another PR that applies some optimization. With those optimizations, we should be able to get ppl score of 12 for QNN on device, which aligns with prepare_pt2e and convert_pt2e.
Thanks.

cccclai

Looks great, thank you!

facebook-github-bot · 2025-08-04T03:59:43Z

@cccclai has imported this pull request. If you are a Meta employee, you can view this in D79354902.

winskuo-quic requested a review from cccclai as a code owner July 29, 2025 15:20

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 29, 2025

winskuo-quic force-pushed the dev1/winskuo/device_eval_ppl branch from 4f7d594 to dd3b173 Compare July 31, 2025 03:21

winskuo-quic force-pushed the dev1/winskuo/device_eval_ppl branch from dd3b173 to 1cd8022 Compare July 31, 2025 04:46

Qualcomm AI Engine Direct - Support simple_eval in calibration, perpl…

7a1a1d3

…exity, and UT

winskuo-quic force-pushed the dev1/winskuo/device_eval_ppl branch from bb81d56 to f0e16d1 Compare August 2, 2025 07:07

winskuo-quic force-pushed the dev1/winskuo/device_eval_ppl branch from 43ce5b2 to 7a1a1d3 Compare August 3, 2025 01:53

cccclai approved these changes Aug 4, 2025

View reviewed changes

cccclai merged commit c1dba0f into pytorch:main Aug 4, 2025
100 of 105 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Qualcomm AI Engine Direct - Support simple_eval in calibration, perpl… #12958

Qualcomm AI Engine Direct - Support simple_eval in calibration, perpl… #12958

Uh oh!

winskuo-quic commented Jul 29, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jul 29, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jul 29, 2025

Uh oh!

winskuo-quic commented Jul 30, 2025

Uh oh!

cccclai commented Jul 30, 2025

Uh oh!

winskuo-quic commented Jul 31, 2025

Uh oh!

facebook-github-bot commented Jul 31, 2025

Uh oh!

haowhsu-quic commented Aug 2, 2025 •

edited

Loading

Uh oh!

winskuo-quic commented Aug 2, 2025 •

edited

Loading

Uh oh!

cccclai left a comment

Uh oh!

facebook-github-bot commented Aug 4, 2025

Uh oh!

Uh oh!

Uh oh!

Qualcomm AI Engine Direct - Support simple_eval in calibration, perpl… #12958

Qualcomm AI Engine Direct - Support simple_eval in calibration, perpl… #12958

Uh oh!

Conversation

winskuo-quic commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Script

Test plan

Uh oh!

pytorch-bot bot commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12958

❌ 3 New Failures

Uh oh!

github-actions bot commented Jul 29, 2025

This PR needs a release notes: label

Uh oh!

winskuo-quic commented Jul 30, 2025

Uh oh!

cccclai commented Jul 30, 2025

Uh oh!

winskuo-quic commented Jul 31, 2025

Uh oh!

facebook-github-bot commented Jul 31, 2025

Uh oh!

haowhsu-quic commented Aug 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

winskuo-quic commented Aug 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cccclai left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Aug 4, 2025

Uh oh!

Uh oh!

Uh oh!

winskuo-quic commented Jul 29, 2025 •

edited

Loading

pytorch-bot bot commented Jul 29, 2025 •

edited

Loading

This PR needs a `release notes:` label

haowhsu-quic commented Aug 2, 2025 •

edited

Loading

winskuo-quic commented Aug 2, 2025 •

edited

Loading