From 4c2fbd6ba89a01e23e964d16704d3ea5b9e3b8a4 Mon Sep 17 00:00:00 2001 From: Jessie Mongeon <133128541+jessiemongeon1@users.noreply.github.com> Date: Wed, 18 Jun 2025 17:15:30 -0500 Subject: [PATCH 1/3] Update README.md --- rust/simd/README.md | 178 ++++---------------------------------------- 1 file changed, 14 insertions(+), 164 deletions(-) diff --git a/rust/simd/README.md b/rust/simd/README.md index 0bc57b5f8..6724c92bb 100644 --- a/rust/simd/README.md +++ b/rust/simd/README.md @@ -1,4 +1,4 @@ -# WebAssembly SIMD Example +# WebAssembly SIMD example Unlike other blockchains, the Internet Computer supports WebAssembly SIMD ([Single Instruction, Multiple Data](https://en.wikipedia.org/wiki/Single_instruction,_multiple_data)) @@ -9,59 +9,22 @@ This example showcases different approaches to utilizing the new SIMD instructio The example consists of a canister named `mat_mat_mul` (matrix-matrix multiplication). -## Prerequisites +## Deploying from ICP Ninja -This example requires an installation of: +[![](https://icp.ninja/assets/open.svg)](https://icp.ninja/editor?g=https://github.com/dfinity/examples/tree/master/rust/simd) -- [x] Install the [IC SDK](https://internetcomputer.org/docs/current/developer-docs/getting-started/install). Note: the WebAssembly SIMD support requires `dfx` version `0.20.2-beta.0` or later. -- [x] Clone the example dapp project: `git clone https://github.com/dfinity/examples` +## Build and deploy from the command-line -### Example 1: Floating point matrices multiplications +### 1. [Download and install the IC SDK.](https://internetcomputer.org/docs/building-apps/getting-started/install) -- #### Step 1: Setup project environment +### 2. Download your project from ICP Ninja using the 'Download files' button on the upper left corner, or [clone the GitHub examples repository.](https://github.com/dfinity/examples/) -Navigate into the folder containing the project's files and start a local instance of the replica with the command: +### 3. Navigate into the project's directory. -```sh -cd examples/rust/simd -dfx start --clean -``` - -```sh -dfx start --clean -Running dfx start for version 0.20.2-beta.0 -[...] -Dashboard: http://localhost:63387/_/dashboard -``` - -- #### Step 2: Open another terminal window in the same directory - -```sh -cd examples/rust/simd -``` - -- #### Step 3: Compile and deploy `mat_mat_mul` canister - -```sh -dfx deploy -``` +### 4. Run `dfx start --background --clean && dfx deploy` to deploy the project to your local environment. -Example output: - -```sh -% dfx deploy -[...] -Deployed canisters. -URLs: - Backend canister via Candid interface: - mat_mat_mul: http://127.0.0.1/?canisterId=... -``` - -- #### Step 4: Compare the amount of instructions used for different matrix multiplication implementations - -Call a loop performing 1K element-wise multiplications of `K x 4` packed slices -from matrices `A` and `B` using optimized algorithm, the same algorithm with -Rust auto-vectorization enabled, and WebAssembly SIMD instructions: +Compare the amount of instructions used for different matrix multiplication implementations. Call a loop performing 1K element-wise multiplications of `K x 4` packed slices +from matrices `A` and `B` using optimized algorithm, the same algorithm with Rust auto-vectorization enabled, and WebAssembly SIMD instructions: ```sh dfx canister call mat_mat_mul optimized_f32 @@ -69,133 +32,20 @@ dfx canister call mat_mat_mul auto_vectorized_f32 dfx canister call mat_mat_mul simd_f32 ``` -Example output: - -```sh -% dfx canister call mat_mat_mul optimized_f32 -(168_542_255 : nat64) -% dfx canister call mat_mat_mul auto_vectorized_f32 -(13_697_228 : nat64) -% dfx canister call mat_mat_mul simd_f32 -(13_697_228 : nat64) -``` - In this example, Rust's auto-vectorization shines in optimizing matrix multiplication. The auto-vectorized code achieves over 10x speedup compared to the optimized version! Also, it's on par with the hand-crafted WebAssembly SIMD multiplication. -### Example 2: Integer matrices multiplications - -- #### Step 1: Setup project environment - -Navigate into the folder containing the project's files and start a local instance of the replica with the command: - -```sh -cd examples/rust/simd -dfx start --clean -``` - -```sh -dfx start --clean -Running dfx start for version 0.20.2-beta.0 -[...] -Dashboard: http://localhost:63387/_/dashboard -``` - -- #### Step 2: Open another terminal window in the same directory - -```sh -cd examples/rust/simd -``` - -- #### Step 3: Compile and deploy `mat_mat_mul` canister - -```sh -dfx deploy -``` - -Example output: - -```sh -% dfx deploy -[...] -Deployed canisters. -URLs: - Backend canister via Candid interface: - mat_mat_mul: http://127.0.0.1/?canisterId=... -``` - -- #### Step 4: Compare the amount of instructions used for different matrix multiplication implementations - -Call a loop performing 1K element-wise multiplications of `K x 4` packed slices -from matrices `A` and `B` using optimized algorithm and the same algorithm -with Rust auto-vectorization enabled: +Compare the amount of instructions used for different matrix multiplication implementations. Call a loop performing 1K element-wise multiplications of `K x 4` packed slices +from matrices `A` and `B` using optimized algorithm and the same algorithm with Rust auto-vectorization enabled: ```sh dfx canister call mat_mat_mul optimized_u32 dfx canister call mat_mat_mul auto_vectorized_u32 ``` -Example output: - -```sh -% dfx canister call mat_mat_mul optimized_u32 -(32_342_253 : nat64) -% dfx canister call mat_mat_mul auto_vectorized_u32 -(16_164_254 : nat64) -``` - -Rust auto-vectorization again demonstrates its power in this example. -The auto-vectorized version of the integer matrix multiplication achieves -more than a 2x speedup compared to the original code. - -## Further learning - -1. Have a look at the locally running dashboard. The URL is at the end of the `dfx start` command: `Dashboard: http://localhost/...` -2. Check out `mat_mat_mul` canister Candid user interface. The URLs are at the end of the `dfx deploy` command: `mat_mat_mul: http://127.0.0.1/?canisterId=...` - -### Canister interface - -The `mat_mat_mul` canister provide the following interface: - -- `naive_f32`/`naive_u32` — - returns the number of instructions used for a loop performing - 1K element-wise multiplications of matrices `A` and `B` - using naive algorithm. -- `optimized_f32`/`optimized_u32` — - returns the number of instructions used for a loop performing - 1K element-wise multiplications of `K x 4` packed slices - from matrices `A` and `B` using optimized algorithm. -- `auto_vectorized_f32`/`auto_vectorized_u32` — - returns the number of instructions used for a loop performing - 1K element-wise multiplications of `K x 4` packed slices - from matrices `A` and `B` using Rust loop auto-vectorization. -- `simd_f32` — - Returns the number of instructions used for a loop performing - 1K element-wise multiplications of `K x 4` packed slices - from matrices `A` and `B` using WebAssembly SIMD instructions. - -Example usage: - -```sh -dfx canister call mat_mat_mul naive_f32 -``` - -## Conclusion - -WebAssembly SIMD instructions unlock new possibilities for the Internet Computer, -particularly in Machine Learning and Artificial Intelligence dApps. This example -demonstrates potential 10x speedups for matrix multiplication with minimal effort -using just Rust's loop auto-vectorization. - -As shown in Example 2, integer operations also benefit, although with a more modest -"2x" speedup. - -The actual speedups will vary depending on the specific application and the type -of operations involved. +Rust auto-vectorization again demonstrates its power in this example. The auto-vectorized version of the integer matrix multiplication achieves more than a 2x speedup compared to the original code. ## Security considerations and best practices -If you base your application on this example, we recommend you familiarize -yourself with and adhere to the [security best practices](https://internetcomputer.org/docs/current/references/security/) -for developing on the Internet Computer. This example may not implement all the best practices. +If you base your application on this example, it is recommended that you familiarize yourself with and adhere to the [security best practices](https://internetcomputer.org/docs/building-apps/security/overview) for developing on ICP. This example may not implement all the best practices. From 7a0665fbb5344f479ecb78397fc6c4b9718eaf0f Mon Sep 17 00:00:00 2001 From: Jessie Mongeon <133128541+jessiemongeon1@users.noreply.github.com> Date: Fri, 20 Jun 2025 09:12:21 -0500 Subject: [PATCH 2/3] Update README.md --- rust/simd/README.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/rust/simd/README.md b/rust/simd/README.md index 6724c92bb..9252dd29f 100644 --- a/rust/simd/README.md +++ b/rust/simd/README.md @@ -21,7 +21,11 @@ The example consists of a canister named `mat_mat_mul` (matrix-matrix multiplica ### 3. Navigate into the project's directory. -### 4. Run `dfx start --background --clean && dfx deploy` to deploy the project to your local environment. +### 4. Deploy the project to your local environment: + +``` +dfx start --background --clean && dfx deploy +``` Compare the amount of instructions used for different matrix multiplication implementations. Call a loop performing 1K element-wise multiplications of `K x 4` packed slices from matrices `A` and `B` using optimized algorithm, the same algorithm with Rust auto-vectorization enabled, and WebAssembly SIMD instructions: From a12277fc83cad309992d4823c98ee4896197e96a Mon Sep 17 00:00:00 2001 From: Jessie Mongeon <133128541+jessiemongeon1@users.noreply.github.com> Date: Tue, 24 Jun 2025 10:12:34 -0500 Subject: [PATCH 3/3] Update README.md --- rust/simd/README.md | 62 +++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 57 insertions(+), 5 deletions(-) diff --git a/rust/simd/README.md b/rust/simd/README.md index 9252dd29f..11fab57b9 100644 --- a/rust/simd/README.md +++ b/rust/simd/README.md @@ -27,8 +27,34 @@ The example consists of a canister named `mat_mat_mul` (matrix-matrix multiplica dfx start --background --clean && dfx deploy ``` -Compare the amount of instructions used for different matrix multiplication implementations. Call a loop performing 1K element-wise multiplications of `K x 4` packed slices -from matrices `A` and `B` using optimized algorithm, the same algorithm with Rust auto-vectorization enabled, and WebAssembly SIMD instructions: +### 5. Open another terminal window in the same directory. + +```sh +cd examples/rust/simd +``` + +### 6. Compile and deploy mat_mat_mul canister. + +```sh +dfx deploy +``` + +Example output: + +```sh +% dfx deploy +[...] +Deployed canisters. +URLs: + Backend canister via Candid interface: + mat_mat_mul: http://127.0.0.1/?canisterId=... +``` + +### 7. Compare the amount of instructions used for different matrix multiplication implementations. + +Call a loop performing 1K element-wise multiplications of `K x 4` packed slices +from matrices `A` and `B` using optimized algorithm, the same algorithm with +Rust auto-vectorization enabled, and WebAssembly SIMD instructions: ```sh dfx canister call mat_mat_mul optimized_f32 @@ -36,19 +62,45 @@ dfx canister call mat_mat_mul auto_vectorized_f32 dfx canister call mat_mat_mul simd_f32 ``` +Example output: + +```sh +% dfx canister call mat_mat_mul optimized_f32 +(168_542_255 : nat64) +% dfx canister call mat_mat_mul auto_vectorized_f32 +(13_697_228 : nat64) +% dfx canister call mat_mat_mul simd_f32 +(13_697_228 : nat64) +``` + In this example, Rust's auto-vectorization shines in optimizing matrix multiplication. The auto-vectorized code achieves over 10x speedup compared to the optimized version! Also, it's on par with the hand-crafted WebAssembly SIMD multiplication. -Compare the amount of instructions used for different matrix multiplication implementations. Call a loop performing 1K element-wise multiplications of `K x 4` packed slices -from matrices `A` and `B` using optimized algorithm and the same algorithm with Rust auto-vectorization enabled: + +### 8. Compare the amount of instructions used for different matrix multiplication implementations + +Call a loop performing 1K element-wise multiplications of `K x 4` packed slices +from matrices `A` and `B` using optimized algorithm and the same algorithm +with Rust auto-vectorization enabled: ```sh dfx canister call mat_mat_mul optimized_u32 dfx canister call mat_mat_mul auto_vectorized_u32 ``` -Rust auto-vectorization again demonstrates its power in this example. The auto-vectorized version of the integer matrix multiplication achieves more than a 2x speedup compared to the original code. +Example output: + +```sh +% dfx canister call mat_mat_mul optimized_u32 +(32_342_253 : nat64) +% dfx canister call mat_mat_mul auto_vectorized_u32 +(16_164_254 : nat64) +``` + +Rust auto-vectorization again demonstrates its power in this example. +The auto-vectorized version of the integer matrix multiplication achieves +more than a 2x speedup compared to the original code. ## Security considerations and best practices