Skip to content

[EXIR] Register _clone_dim_order op and map aten.clone #12971

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

keyprocedure
Copy link
Contributor

@keyprocedure keyprocedure commented Jul 29, 2025

Summary

This is PR 2 of 3 implementing a dim order aware clone op.

This PR registers the new _clone_dim_order op and maps aten.clone ops to dim_order_ops._clone_dim_order during export to preserve memory layout changes (contiguous/channels_last).

Related PRs:

  • PR 1: #12974 - Add _clone_dim_order portable kernel
  • PR 3: #12976 - Update RemoveCloneOpsTransform to be dim order aware

Fixes #12645

Test plan

  • Operator level tests to verify kernel behavior for layout preservation and changes.
  • Graph level checks to confirm that clone mapping occurs.
  • End to end tests to validate that functional clone behavior is unchanged.

All tests pass via:
python -m unittest exir.tests.test_memory_format_ops_pass

Copy link

pytorch-bot bot commented Jul 29, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12971

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 29, 2025
Copy link
Contributor

@Gasoonjia Gasoonjia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Left a minor feedback about test coverage!

@@ -389,3 +390,17 @@ def test_mobilenet_v3_xnnpack(self) -> None:
rtol=1e-3,
),
)

def test_op_clone_dim_order_registration(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you follow other testcases way (e.g, test_op_dim_order_propagation), to reuse the memory_format_test_runner function and test at least two different scenarios: one for maintaining dim order, the other is converting the dim order?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can definitely add more tests, the reason I didn't here was because the kernel implementation wasn't in place for this PR. I'll work on adding additional coverage.

@keyprocedure
Copy link
Contributor Author

@pytorchbot label "release notes: none"

@pytorch-bot pytorch-bot bot added the release notes: none Do not include this in the release notes label Jul 29, 2025
@keyprocedure
Copy link
Contributor Author

@Gasoonjia splitting the clone_dim_order op implementation into 3 PRs was a great suggestion, it makes it much easier to isolate and debug issues.

The CI failures are mainly due to two issues:

  1. clone is now mapped to clone_dim_order, so tests expecting executorch_exir_dialects_edge__ops_aten_clone_default are failing because they now see executorch_exir_dialects_edge__ops_dim_order_ops__clone_dim_order_default.
  2. Because of the first issue, the lack of a kernel implementation for clone_dim_order in this PR is also triggering failures.

Do you have any suggestions for how we should handle the current approach of mapping clone to clone_dim_order?

@Gasoonjia
Copy link
Contributor

Hi @keyprocedure thanks for your update!

For places checking the executorch_exir_dialects_edge__ops_aten_clone_default, please update them to check executorch_exir_dialects_edge__ops_dim_order_ops__clone_dim_order_default as long as the compile_config.skip_dim_order == False in their pipeline.

the lack of a kernel implementation for clone_dim_order

can you share me more insights here? I think you've create clone_dim_order kernel in aot side by reusing aten clone op.

@keyprocedure
Copy link
Contributor Author

keyprocedure commented Jul 31, 2025

@Gasoonjia during export _clone_dim_order does use ATen's clone implementation because of the Python fallback, but at runtime the graph contains _clone_dim_order and since no kernel is registered for it, it fails with:
[operator_registry.cpp:256] kernel 'dim_order_ops::_clone_dim_order.out' not found.

This is also why I placed the tests using memory_format_test_runner in the kernel PR instead of this one.

What are your thoughts?

@Gasoonjia
Copy link
Contributor

@keyprocedure thanks for detailed reply!

let's swap the order of PRs. We can first finish the runtime op, then aot replacement, finally RemoveCloneOpsTransform.
That would solve the runtime operator requirement issue.

@keyprocedure
Copy link
Contributor Author

@keyprocedure thanks for detailed reply!

let's swap the order of PRs. We can first finish the runtime op, then aot replacement, finally RemoveCloneOpsTransform. That would solve the runtime operator requirement issue.

That's a great idea! Should I move the tests from the runtime/kernel PR (since the new op won't be registered yet) to the aot replacement PR?

@Gasoonjia
Copy link
Contributor

Gasoonjia commented Jul 31, 2025

@keyprocedure thanks for detailed reply!
let's swap the order of PRs. We can first finish the runtime op, then aot replacement, finally RemoveCloneOpsTransform. That would solve the runtime operator requirement issue.

That's a great idea! Should I move the tests from the runtime/kernel PR (since the new op won't be registered yet) to the aot replacement PR?

Yes please! More specific:

for the runtime pr, it should only contain:

  1. the portable clone_dim_order op
  2. the operator test itself

for aot pr, it should contains both graph-level check, and the end2end test using pybinding.

@keyprocedure keyprocedure changed the title [EXIR] Add dim order aware clone operator [EXIR] Map clone to _clone_dim_order Aug 3, 2025
@keyprocedure
Copy link
Contributor Author

keyprocedure commented Aug 4, 2025

Hi @Gasoonjia

I’ve updated the PRs, the kernel PR is ready for review.

The CI failures from this PR that weren’t due to the missing _clone_dim_order kernel are all ARM related runtime failures:

  • Expected to not find "torch.ops.higher_order.executorch_call_delegate" but found it
  • Expected to find "torch.ops.higher_order.executorch_call_delegate" but did not find it
  • Expected out tensor to have dtype signed char, but got float instead
  • Expected to find "executorch_exir_dialects_edge__ops_aten_clone_default" but did not find it

It seems like these failures share the same root cause: aten.clone being replaced by _clone_dim_order, but since the failures are spread across quite a few tests, I'm not sure which strategy we should use to address this.
Does it make sense to individually try to correct each failing test or is there a way to gate the ARM backend to still use the
aten.clone op?

Any insights or guidance would be helpful

@Gasoonjia
Copy link
Contributor

Gasoonjia commented Aug 4, 2025

@keyprocedure
Thanks for sharing me the error msg!
I've left some comment for the kernel PR, please take a look.

For the issue in this PR, can you share me the code pointer to the issue? Like which line raising that?

@keyprocedure
Copy link
Contributor Author

There were a total of 49 failing tests for the initial CI run, so I listed out the test and line number of each failure:

unittest-arm-backend-with-no-fvp (test_pytest_ops) / linux-job

19 failures

backends/arm/test/models/stable_diffusion/test_T5EncoderModel.py:100:
    TestT5EncoderModel.test_T5EncoderModel_tosa_BI

"AssertionError: Invalid partition, found dependency cycles"
backends/arm/test/models/stable_diffusion/test_T5EncoderModel.py:80:
    TestT5EncoderModel.test_T5EncoderModel_tosa_MI

"IndexError: list index out of range"
backends/arm/test/models/stable_diffusion/test_CLIPTextModelWithProjection.py:76:
    TestCLIPTextModelWithProjection.test_CLIPTextModelWithProjection_tosa_MI

backends/arm/test/models/stable_diffusion/test_SD3Transformer2DModel.py:128:
    TestSD3Transformer2DModel.test_SD3Transformer2DModel_tosa_BI

backends/arm/test/models/stable_diffusion/test_SD3Transformer2DModel.py:105:
    TestSD3Transformer2DModel.test_SD3Transformer2DModel_tosa_MI

backends/arm/test/models/stable_diffusion/test_vae_AutoencoderKL.py:74:
    TestAutoencoderKL.test_AutoencoderKL_tosa_BI

backends/arm/test/models/stable_diffusion/test_vae_AutoencoderKL.py:55:
    TestAutoencoderKL.test_AutoencoderKL_tosa_MI

backends/arm/test/models/test_conformer.py:60:
    test_conformer_tosa_MI

backends/arm/test/models/test_deit_tiny_arm.py:45:
    test_deit_tiny_tosa_MI

backends/arm/test/models/test_deit_tiny_arm.py:58:
    test_deit_tiny_tosa_BI

backends/arm/test/models/test_dl3_arm.py:44:
    test_dl3_tosa_MI

backends/arm/test/models/test_dl3_arm.py:57:
    test_dl3_tosa_BI

backends/arm/test/models/test_mobilenet_v2_arm.py:45:
    test_mv2_tosa_MI

backends/arm/test/models/test_mobilenet_v2_arm.py:60:
    test_mv2_tosa_BI[per_channel_quantization=true]
    test_mv2_tosa_BI[per_channel_quantization=false]

backends/arm/test/models/test_mobilenet_v2_arm.py:78:
    test_mv2_u55_BI[per_channel_quantization=true]
    test_mv2_u55_BI[per_channel_quantization=false]

backends/arm/test/models/test_mobilenet_v2_arm.py:96:
    test_mv2_u85_BI[per_channel_quantization=true]
    test_mv2_u85_BI[per_channel_quantization=false]

"RuntimeError: Expected to not find "torch.ops.higher_order.executorch_call_delegate" but found it"

The CI log also shows multiple TOSA partitioning rejections due to _clone_dim_order not being in BaseTOSASupportList.


unittest-arm-backend-with-no-fvp (test_pytest_models) / linux-job

30 failures

backends/arm/test/ops/test_alias_copy.py:63:
    test_alias_tosa_BI[1d_ramp]
    test_alias_tosa_BI[2d_ones]
    test_alias_tosa_BI[3d_rand]
    test_alias_tosa_BI[4d_zeros]

"RuntimeError: Expected out tensor to have dtype signed char, but got float instead"
backends/arm/test/ops/test_clone.py:58:
    test_clone_tosa_MI[ones_1D_10]
    test_clone_tosa_MI[ones_1D_50]
    test_clone_tosa_MI[rand_1D_20]
    test_clone_tosa_MI[rand_2D_10x10]
    test_clone_tosa_MI[rand_3D_5x5x5]
    test_clone_tosa_MI[rand_4D_2x3x4x5]
    test_clone_tosa_MI[large_tensor]

backends/arm/test/ops/test_clone.py:69:
    test_clone_tosa_BI[ones_1D_10]
    test_clone_tosa_BI[ones_1D_50]
    test_clone_tosa_BI[rand_1D_20]
    test_clone_tosa_BI[rand_2D_10x10]
    test_clone_tosa_BI[rand_3D_5x5x5]
    test_clone_tosa_BI[rand_4D_2x3x4x5]
    test_clone_tosa_BI[large_tensor]

backends/arm/test/passes/test_remove_clone_pass.py:43:
    test_remove_clone_tosa_BI

"RuntimeError: Expected to find "torch.ops.higher_order.executorch_call_delegate" but did not find it"
backends/arm/test/ops/test_multihead_attention.py:48:
    test_multihead_attention_tosa_MI[rand_2d]
    test_multihead_attention_tosa_MI[randn_2d]
    test_multihead_attention_tosa_MI[randn_3d]

backends/arm/test/ops/test_multihead_attention.py:65:
    test_multihead_attention_tosa_BI[rand_2d]
    test_multihead_attention_tosa_BI[randn_2d]
    test_multihead_attention_tosa_BI[randn_3d]

backends/arm/test/ops/test_multihead_attention.py:108:
    test_multihead_attention_u85_BI[rand_2d]
    test_multihead_attention_u85_BI[randn_2d]
    test_multihead_attention_u85_BI[randn_3d]

backends/arm/test/ops/test_repeat.py:74:
    test_repeat_tosa_MI[interleave_int_3_x_1]

"RuntimeError: Expected to not find "torch.ops.higher_order.executorch_call_delegate" but found it"
backends/arm/test/ops/test_repeat.py:86:
    test_repeat_tosa_BI[interleave_int_3_x_1]

"RuntimeError: Expected out tensor to have dtype signed char, but got float instead"

@keyprocedure
Copy link
Contributor Author

I think we should land the kernel PR before addressing any new failures from the latest CI run in this PR since this PR is dependent on the kernel existing for _clone_dim_order.

@Gasoonjia
Copy link
Contributor

sure let's land prs one by one

@keyprocedure keyprocedure changed the title [EXIR] Map clone to _clone_dim_order [EXIR] Register _clone_dim_order op and map aten.clone to it Aug 6, 2025
@keyprocedure keyprocedure changed the title [EXIR] Register _clone_dim_order op and map aten.clone to it [EXIR] Register _clone_dim_order op and map aten.clone Aug 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. release notes: none Do not include this in the release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add dim order variant clone operator
2 participants