Skip to content

Qualcomm AI Engine Direct - Support simple_eval in calibration, perpl… #12958

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 4, 2025

Conversation

winskuo-quic
Copy link
Collaborator

@winskuo-quic winskuo-quic commented Jul 29, 2025

Summary

  • Enable Perplexity Evaluation on device with llama.py
  • Evaluate perplexity after qdq cpu
  • Enable quantization to use simple_eval as calibration dataset.
  • Enable UT to check perplexity for QWEN, which should be more reliable than checking the string output.

Will have a follow up PR to address:

  • External CI enablement for qwen on x86 (If it does not take too long).
  • Hide Logits scale/offset to metadata in model

Script

python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s $DEVICE -m SM8750 --prompt "What is 1+1?" --temperature 0 --model_mode kv --max_seq_len 1024 --ptq 16a8w --decoder_model qwen2_5 --eval_perplexity --tasks wikitext

Test plan

python backends/qualcomm/tests/test_qnn_delegate.py -k TestExampleLLMScript.test_static_qwen2_5 --model SM8650 --build_folder build-android/ --executorch_root . -s $DEVICE

Author: @shewu-quic, @winskuo-quic

@winskuo-quic winskuo-quic requested a review from cccclai as a code owner July 29, 2025 15:20
Copy link

pytorch-bot bot commented Jul 29, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12958

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures

As of commit 7a1a1d3 with merge base 9e00a51 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 29, 2025
Copy link

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@winskuo-quic
Copy link
Collaborator Author

Hi @cccclai,
We are interested in using a more robust way to evaluate the accuracy of QNN model.
Before, we only ensure the first few characters matches the golden, is rather unsafe and unstable.
This PR will evaluate e2e flow for the 3 phases: prepare_pt2e (cpu fp32), convert_pt2e(cpu qdq), QNN on device.
At the end, we use QNN ppl score to check if there's regression.
We would like to add this test to ExecuTorch's CI, using x86 emulator. However, this could probably take a couple hours (estimate 3-6 hours) to finish with x86 emulator. Do you know if there are any concerns with the CI time limit?
Thanks

@cccclai
Copy link
Contributor

cccclai commented Jul 30, 2025

Hi @cccclai, We are interested in using a more robust way to evaluate the accuracy of QNN model. Before, we only ensure the first few characters matches the golden, is rather unsafe and unstable. This PR will evaluate e2e flow for the 3 phases: prepare_pt2e (cpu fp32), convert_pt2e(cpu qdq), QNN on device. At the end, we use QNN ppl score to check if there's regression. We would like to add this test to ExecuTorch's CI, using x86 emulator. However, this could probably take a couple hours (estimate 3-6 hours) to finish with x86 emulator. Do you know if there are any concerns with the CI time limit? Thanks

There are some jobs running less frequently like https://github.com/pytorch/executorch/blob/b6b7a16df5e7852d976d8c34c8a7e9a1b6f7d005/.github/workflows/periodic.yml and https://github.com/pytorch/executorch/blob/b6b7a16df5e7852d976d8c34c8a7e9a1b6f7d005/.github/workflows/nightly.yml they won't block CI. Maybe we can use these jobs

@winskuo-quic winskuo-quic force-pushed the dev1/winskuo/device_eval_ppl branch from 4f7d594 to dd3b173 Compare July 31, 2025 03:21
@winskuo-quic
Copy link
Collaborator Author

Hi @cccclai, We are interested in using a more robust way to evaluate the accuracy of QNN model. Before, we only ensure the first few characters matches the golden, is rather unsafe and unstable. This PR will evaluate e2e flow for the 3 phases: prepare_pt2e (cpu fp32), convert_pt2e(cpu qdq), QNN on device. At the end, we use QNN ppl score to check if there's regression. We would like to add this test to ExecuTorch's CI, using x86 emulator. However, this could probably take a couple hours (estimate 3-6 hours) to finish with x86 emulator. Do you know if there are any concerns with the CI time limit? Thanks

There are some jobs running less frequently like https://github.com/pytorch/executorch/blob/b6b7a16df5e7852d976d8c34c8a7e9a1b6f7d005/.github/workflows/periodic.yml and https://github.com/pytorch/executorch/blob/b6b7a16df5e7852d976d8c34c8a7e9a1b6f7d005/.github/workflows/nightly.yml they won't block CI. Maybe we can use these jobs

Sure! We will support the CI in the future PR under these yml files. Thanks

@winskuo-quic winskuo-quic force-pushed the dev1/winskuo/device_eval_ppl branch from dd3b173 to 1cd8022 Compare July 31, 2025 04:46
@facebook-github-bot
Copy link
Contributor

@cccclai has imported this pull request. If you are a Meta employee, you can view this in D79354902.

@haowhsu-quic
Copy link
Collaborator

haowhsu-quic commented Aug 2, 2025

Hi @cccclai,

There are some incoming PRs would like to let you know:

  • 16bit kv cache IO in runtime
  • documentation of llam3_1B/3B, qwen2_0.5B/1.5B e2e with instructions & metrics (we've verified it and get pretty good result)
  • Model enablement for qwen3_0.6B/1.7B, smollm2_135m_instruct, phi4_mini_instruct, gemma3. (Accuracy will be addressed in other PRs)

Thank you.

@winskuo-quic winskuo-quic force-pushed the dev1/winskuo/device_eval_ppl branch from bb81d56 to f0e16d1 Compare August 2, 2025 07:07
@winskuo-quic
Copy link
Collaborator Author

winskuo-quic commented Aug 2, 2025

Hi @cccclai,
This PR is now ready to review. Thanks!
For this PR, with the following script, we are expecting to get ppl score of 19 for QNN on device, which is slightly higher than prepare_pt2e and convert_pt2e (both around 12). With this PR, you should be able to see some reasonable outputs but still not the best. The token rate should be somewhere around 125-130 tok/sec on SM8750.
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s $DEVICE -m SM8750 --prompt "What is 1+1?" --temperature 0 --model_mode kv --max_seq_len 1024 --ptq 16a8w --decoder_model qwen2_5 --eval_perplexity --tasks wikitext --limit 1

After this PR is merged, we will push another PR that applies some optimization. With those optimizations, we should be able to get ppl score of 12 for QNN on device, which aligns with prepare_pt2e and convert_pt2e.
Thanks.

@winskuo-quic winskuo-quic force-pushed the dev1/winskuo/device_eval_ppl branch from 43ce5b2 to 7a1a1d3 Compare August 3, 2025 01:53
Copy link
Contributor

@cccclai cccclai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thank you!

@facebook-github-bot
Copy link
Contributor

@cccclai has imported this pull request. If you are a Meta employee, you can view this in D79354902.

@cccclai cccclai merged commit c1dba0f into pytorch:main Aug 4, 2025
100 of 105 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants