From 4c2fbd6ba89a01e23e964d16704d3ea5b9e3b8a4 Mon Sep 17 00:00:00 2001
From: Jessie Mongeon <133128541+jessiemongeon1@users.noreply.github.com>
Date: Wed, 18 Jun 2025 17:15:30 -0500
Subject: [PATCH 1/3] Update README.md

---
 rust/simd/README.md | 178 ++++----------------------------------------
 1 file changed, 14 insertions(+), 164 deletions(-)

diff --git a/rust/simd/README.md b/rust/simd/README.md
index 0bc57b5f8..6724c92bb 100644
--- a/rust/simd/README.md
+++ b/rust/simd/README.md
@@ -1,4 +1,4 @@
-# WebAssembly SIMD Example
+# WebAssembly SIMD example
 
 Unlike other blockchains, the Internet Computer supports WebAssembly
 SIMD ([Single Instruction, Multiple Data](https://en.wikipedia.org/wiki/Single_instruction,_multiple_data))
@@ -9,59 +9,22 @@ This example showcases different approaches to utilizing the new SIMD instructio
 
 The example consists of a canister named `mat_mat_mul` (matrix-matrix multiplication).
 
-## Prerequisites
+## Deploying from ICP Ninja
 
-This example requires an installation of:
+[![](https://icp.ninja/assets/open.svg)](https://icp.ninja/editor?g=https://github.com/dfinity/examples/tree/master/rust/simd)
 
-- [x] Install the [IC SDK](https://internetcomputer.org/docs/current/developer-docs/getting-started/install). Note: the WebAssembly SIMD support requires `dfx` version `0.20.2-beta.0` or later.
-- [x] Clone the example dapp project: `git clone https://github.com/dfinity/examples`
+## Build and deploy from the command-line
 
-### Example 1: Floating point matrices multiplications
+### 1. [Download and install the IC SDK.](https://internetcomputer.org/docs/building-apps/getting-started/install)
 
-- #### Step 1: Setup project environment
+### 2. Download your project from ICP Ninja using the 'Download files' button on the upper left corner, or [clone the GitHub examples repository.](https://github.com/dfinity/examples/)
 
-Navigate into the folder containing the project's files and start a local instance of the replica with the command:
+### 3. Navigate into the project's directory.
 
-```sh
-cd examples/rust/simd
-dfx start --clean
-```
-
-```sh
-dfx start --clean
-Running dfx start for version 0.20.2-beta.0
-[...]
-Dashboard: http://localhost:63387/_/dashboard
-```
-
-- #### Step 2: Open another terminal window in the same directory
-
-```sh
-cd examples/rust/simd
-```
-
-- #### Step 3: Compile and deploy `mat_mat_mul` canister
-
-```sh
-dfx deploy
-```
+### 4. Run `dfx start --background --clean && dfx deploy` to deploy the project to your local environment. 
 
-Example output:
-
-```sh
-% dfx deploy
-[...]
-Deployed canisters.
-URLs:
-   Backend canister via Candid interface:
-      mat_mat_mul: http://127.0.0.1/?canisterId=...
-```
-
-- #### Step 4: Compare the amount of instructions used for different matrix multiplication implementations
-
-Call a loop performing 1K element-wise multiplications of `K x 4` packed slices
-from matrices `A` and `B` using optimized algorithm, the same algorithm with
-Rust auto-vectorization enabled, and WebAssembly SIMD instructions:
+Compare the amount of instructions used for different matrix multiplication implementations. Call a loop performing 1K element-wise multiplications of `K x 4` packed slices
+from matrices `A` and `B` using optimized algorithm, the same algorithm with Rust auto-vectorization enabled, and WebAssembly SIMD instructions:
 
 ```sh
 dfx canister call mat_mat_mul optimized_f32
@@ -69,133 +32,20 @@ dfx canister call mat_mat_mul auto_vectorized_f32
 dfx canister call mat_mat_mul simd_f32
 ```
 
-Example output:
-
-```sh
-% dfx canister call mat_mat_mul optimized_f32
-(168_542_255 : nat64)
-% dfx canister call mat_mat_mul auto_vectorized_f32
-(13_697_228 : nat64)
-% dfx canister call mat_mat_mul simd_f32
-(13_697_228 : nat64)
-```
-
 In this example, Rust's auto-vectorization shines in optimizing matrix multiplication.
 The auto-vectorized code achieves over 10x speedup compared to the optimized version!
 Also, it's on par with the hand-crafted WebAssembly SIMD multiplication.
 
-### Example 2: Integer matrices multiplications
-
-- #### Step 1: Setup project environment
-
-Navigate into the folder containing the project's files and start a local instance of the replica with the command:
-
-```sh
-cd examples/rust/simd
-dfx start --clean
-```
-
-```sh
-dfx start --clean
-Running dfx start for version 0.20.2-beta.0
-[...]
-Dashboard: http://localhost:63387/_/dashboard
-```
-
-- #### Step 2: Open another terminal window in the same directory
-
-```sh
-cd examples/rust/simd
-```
-
-- #### Step 3: Compile and deploy `mat_mat_mul` canister
-
-```sh
-dfx deploy
-```
-
-Example output:
-
-```sh
-% dfx deploy
-[...]
-Deployed canisters.
-URLs:
-   Backend canister via Candid interface:
-      mat_mat_mul: http://127.0.0.1/?canisterId=...
-```
-
-- #### Step 4: Compare the amount of instructions used for different matrix multiplication implementations
-
-Call a loop performing 1K element-wise multiplications of `K x 4` packed slices
-from matrices `A` and `B` using optimized algorithm and the same algorithm
-with Rust auto-vectorization enabled:
+Compare the amount of instructions used for different matrix multiplication implementations. Call a loop performing 1K element-wise multiplications of `K x 4` packed slices
+from matrices `A` and `B` using optimized algorithm and the same algorithm with Rust auto-vectorization enabled:
 
 ```sh
 dfx canister call mat_mat_mul optimized_u32
 dfx canister call mat_mat_mul auto_vectorized_u32
 ```
 
-Example output:
-
-```sh
-% dfx canister call mat_mat_mul optimized_u32
-(32_342_253 : nat64)
-% dfx canister call mat_mat_mul auto_vectorized_u32
-(16_164_254 : nat64)
-```
-
-Rust auto-vectorization again demonstrates its power in this example.
-The auto-vectorized version of the integer matrix multiplication achieves
-more than a 2x speedup compared to the original code.
-
-## Further learning
-
-1. Have a look at the locally running dashboard. The URL is at the end of the `dfx start` command: `Dashboard: http://localhost/...`
-2. Check out `mat_mat_mul` canister Candid user interface. The URLs are at the end of the `dfx deploy` command: `mat_mat_mul: http://127.0.0.1/?canisterId=...`
-
-### Canister interface
-
-The `mat_mat_mul` canister provide the following interface:
-
-- `naive_f32`/`naive_u32` &mdash;
-  returns the number of instructions used for a loop performing
-  1K element-wise multiplications of matrices `A` and `B`
-  using naive algorithm.
-- `optimized_f32`/`optimized_u32` &mdash;
-  returns the number of instructions used for a loop performing
-  1K element-wise multiplications of `K x 4` packed slices
-  from matrices `A` and `B` using optimized algorithm.
-- `auto_vectorized_f32`/`auto_vectorized_u32` &mdash;
-  returns the number of instructions used for a loop performing
-  1K element-wise multiplications of `K x 4` packed slices
-  from matrices `A` and `B` using Rust loop auto-vectorization.
-- `simd_f32` &mdash;
-  Returns the number of instructions used for a loop performing
-  1K element-wise multiplications of `K x 4` packed slices
-  from matrices `A` and `B` using WebAssembly SIMD instructions.
-
-Example usage:
-
-```sh
-dfx canister call mat_mat_mul naive_f32
-```
-
-## Conclusion
-
-WebAssembly SIMD instructions unlock new possibilities for the Internet Computer,
-particularly in Machine Learning and Artificial Intelligence dApps. This example
-demonstrates potential 10x speedups for matrix multiplication with minimal effort
-using just Rust's loop auto-vectorization.
-
-As shown in Example 2, integer operations also benefit, although with a more modest
-"2x" speedup.
-
-The actual speedups will vary depending on the specific application and the type
-of operations involved.
+Rust auto-vectorization again demonstrates its power in this example. The auto-vectorized version of the integer matrix multiplication achieves more than a 2x speedup compared to the original code.
 
 ## Security considerations and best practices
 
-If you base your application on this example, we recommend you familiarize
-yourself with and adhere to the [security best practices](https://internetcomputer.org/docs/current/references/security/)
-for developing on the Internet Computer. This example may not implement all the best practices.
+If you base your application on this example, it is recommended that you familiarize yourself with and adhere to the [security best practices](https://internetcomputer.org/docs/building-apps/security/overview) for developing on ICP. This example may not implement all the best practices.

From 7a0665fbb5344f479ecb78397fc6c4b9718eaf0f Mon Sep 17 00:00:00 2001
From: Jessie Mongeon <133128541+jessiemongeon1@users.noreply.github.com>
Date: Fri, 20 Jun 2025 09:12:21 -0500
Subject: [PATCH 2/3] Update README.md

---
 rust/simd/README.md | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/rust/simd/README.md b/rust/simd/README.md
index 6724c92bb..9252dd29f 100644
--- a/rust/simd/README.md
+++ b/rust/simd/README.md
@@ -21,7 +21,11 @@ The example consists of a canister named `mat_mat_mul` (matrix-matrix multiplica
 
 ### 3. Navigate into the project's directory.
 
-### 4. Run `dfx start --background --clean && dfx deploy` to deploy the project to your local environment. 
+### 4. Deploy the project to your local environment:
+
+```
+dfx start --background --clean && dfx deploy
+```
 
 Compare the amount of instructions used for different matrix multiplication implementations. Call a loop performing 1K element-wise multiplications of `K x 4` packed slices
 from matrices `A` and `B` using optimized algorithm, the same algorithm with Rust auto-vectorization enabled, and WebAssembly SIMD instructions:

From a12277fc83cad309992d4823c98ee4896197e96a Mon Sep 17 00:00:00 2001
From: Jessie Mongeon <133128541+jessiemongeon1@users.noreply.github.com>
Date: Tue, 24 Jun 2025 10:12:34 -0500
Subject: [PATCH 3/3] Update README.md

---
 rust/simd/README.md | 62 +++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 57 insertions(+), 5 deletions(-)

diff --git a/rust/simd/README.md b/rust/simd/README.md
index 9252dd29f..11fab57b9 100644
--- a/rust/simd/README.md
+++ b/rust/simd/README.md
@@ -27,8 +27,34 @@ The example consists of a canister named `mat_mat_mul` (matrix-matrix multiplica
 dfx start --background --clean && dfx deploy
 ```
 
-Compare the amount of instructions used for different matrix multiplication implementations. Call a loop performing 1K element-wise multiplications of `K x 4` packed slices
-from matrices `A` and `B` using optimized algorithm, the same algorithm with Rust auto-vectorization enabled, and WebAssembly SIMD instructions:
+### 5. Open another terminal window in the same directory.
+
+```sh
+cd examples/rust/simd
+```
+
+### 6. Compile and deploy mat_mat_mul canister.
+
+```sh
+dfx deploy
+```
+
+Example output:
+
+```sh
+% dfx deploy
+[...]
+Deployed canisters.
+URLs:
+   Backend canister via Candid interface:
+      mat_mat_mul: http://127.0.0.1/?canisterId=...
+```
+
+### 7. Compare the amount of instructions used for different matrix multiplication implementations.
+
+Call a loop performing 1K element-wise multiplications of `K x 4` packed slices
+from matrices `A` and `B` using optimized algorithm, the same algorithm with
+Rust auto-vectorization enabled, and WebAssembly SIMD instructions:
 
 ```sh
 dfx canister call mat_mat_mul optimized_f32
@@ -36,19 +62,45 @@ dfx canister call mat_mat_mul auto_vectorized_f32
 dfx canister call mat_mat_mul simd_f32
 ```
 
+Example output:
+
+```sh
+% dfx canister call mat_mat_mul optimized_f32
+(168_542_255 : nat64)
+% dfx canister call mat_mat_mul auto_vectorized_f32
+(13_697_228 : nat64)
+% dfx canister call mat_mat_mul simd_f32
+(13_697_228 : nat64)
+```
+
 In this example, Rust's auto-vectorization shines in optimizing matrix multiplication.
 The auto-vectorized code achieves over 10x speedup compared to the optimized version!
 Also, it's on par with the hand-crafted WebAssembly SIMD multiplication.
 
-Compare the amount of instructions used for different matrix multiplication implementations. Call a loop performing 1K element-wise multiplications of `K x 4` packed slices
-from matrices `A` and `B` using optimized algorithm and the same algorithm with Rust auto-vectorization enabled:
+
+### 8. Compare the amount of instructions used for different matrix multiplication implementations
+
+Call a loop performing 1K element-wise multiplications of `K x 4` packed slices
+from matrices `A` and `B` using optimized algorithm and the same algorithm
+with Rust auto-vectorization enabled:
 
 ```sh
 dfx canister call mat_mat_mul optimized_u32
 dfx canister call mat_mat_mul auto_vectorized_u32
 ```
 
-Rust auto-vectorization again demonstrates its power in this example. The auto-vectorized version of the integer matrix multiplication achieves more than a 2x speedup compared to the original code.
+Example output:
+
+```sh
+% dfx canister call mat_mat_mul optimized_u32
+(32_342_253 : nat64)
+% dfx canister call mat_mat_mul auto_vectorized_u32
+(16_164_254 : nat64)
+```
+
+Rust auto-vectorization again demonstrates its power in this example.
+The auto-vectorized version of the integer matrix multiplication achieves
+more than a 2x speedup compared to the original code.
 
 ## Security considerations and best practices