Skip to content

Commit 8333b04

Browse files
authored
Merge pull request #9 from KernelTuner/dev
Rewrite `map_impl` and extend guide
2 parents 07cedcc + 534c4cd commit 8333b04

29 files changed

+1348
-741
lines changed

docs/build_api.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -154,6 +154,7 @@ def build_index_page(groups):
154154
"all",
155155
"any",
156156
"count",
157+
"dot",
157158
],
158159
"Mathematical": [
159160
("abs", "abs(const V&)"),

docs/guides.rst

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,5 +5,7 @@ Guides
55

66
guides/introduction.rst
77
guides/promotion.rst
8-
guides/prelude.rst
8+
guides/memory.rst
9+
guides/accuracy.rst
910
guides/constant.rst
11+
guides/example.md

docs/guides/accuracy.md

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
# Accuracy Level
2+
3+
For certain operations, there might be alternative versions available that provide better performance at the cost of lower accuracy.
4+
In other words, they are faster but also have a small error.
5+
6+
## Fast Math
7+
8+
For several operations in single precision (float32), there are "fast math" versions. These functions are faster since they are hardware accelerated, but are not IEEE compliant for all inputs.
9+
10+
To use this functionality, use the `fast_*` functions from Kernel Float.
11+
12+
```cpp
13+
kf::vec<float, 4> x = {1.0f, 2.0f, 3.0f, 4.0f};
14+
15+
// Sine
16+
kf::vec<float, 4> a = kf::fast_sin(x);
17+
18+
// Square root
19+
kf::vec<float, 4> b = kf::fast_sqrt(x);
20+
21+
// Reciprocal `1/x`
22+
kf::vec<float, 4> c = kf::fast_rcp(x);
23+
24+
// Division `a/b`
25+
kf::vec<float, 4> d = kf::fast_div(a, b);
26+
```
27+
28+
These functions are only functional for 32-bit and 16-bit floats.
29+
For other input types, the operation falls back to the regular version.
30+
31+
## Approximate Math
32+
33+
For 16-bit floats, several approximate functions are provided.
34+
These use approximations (typically low-degree polynomials) to calculate rough estimates of the functions.
35+
This can be very fast but also less accurate.
36+
37+
38+
To use this functionality, use the `approx_*` functions from Kernel Float. For other input types, the operation falls back to the `fast_*` variant.
39+
40+
```cpp
41+
kf::vec<half, 4> x = {1.0, 2.0, 3.0, 4.0};
42+
43+
// Sine
44+
kf::vec<half, 4> a = kf::approx_sin(x);
45+
46+
// Square root
47+
kf::vec<half, 4> b = kf::approx_sqrt(x);
48+
49+
// Reciprocal `1/x`
50+
kf::vec<half, 4> c = kf::approx_rcp(x);
51+
52+
// Division `a/b`
53+
kf::vec<half, 4> d = kf::approx_div(a, b);
54+
```
55+
56+
You can adjust the degree of approximation by supplying an integer template parameter:
57+
58+
59+
```cpp
60+
// Sine approximation with polynomial of degree 1
61+
kf::vec<half, 4> a = kf::approx_sin<1>(x);
62+
63+
// Polynomial of degree 2
64+
kf::vec<half, 4> a = kf::approx_sin<2>(x);
65+
66+
// Polynomial of degree 3
67+
kf::vec<half, 4> a = kf::approx_sin<3>(x);
68+
```
69+
70+
## Tuning Accuracy Level
71+
72+
Many functions in Kernel Float accept an additional Accuracy option as a template parameter.
73+
This allows you to tune the accuracy level without changing the function name.
74+
75+
There are four possible values for this parameter:
76+
77+
- `kf::accurate_policy`: Use the most accurate version of the function available.
78+
- `kf::fast_policy`: Use the "fast math" version.
79+
- `kf::approx_policy<N>`: Use the approximate version with degree `N`.
80+
- `kf::default_policy`: Use a global default policy (see the next section).
81+
82+
For example, consider this code:
83+
84+
```cpp
85+
kf::vec<float, 2> input = {1.0f, 2.0f};
86+
87+
// Use the default policy
88+
kf::vec<float, 2> a = kf::cos(input);
89+
90+
// Use the default policy
91+
kf::vec<float, 2> b = kf::cos<kf::default_policy>(input);
92+
93+
// Use the most accurate policy
94+
kf::vec<float, 2> c = kf::cos<kf::accurate_policy>(input);
95+
96+
// Use the fastest policy
97+
kf::vec<float, 2> d = kf::cos<kf::fast_policy>(input);
98+
99+
// Use the approximate policy
100+
kf::vec<float, 2> e = kf::cos<kf::approx_policy<3>>(input);
101+
102+
// You can use aliases to define your own policy
103+
using my_own_policy = kf::fast_policy;
104+
kf::vec<float, 2> f = kf::cos<my_own_policy>(input);
105+
```
106+
107+
## Setting `default_policy`
108+
109+
By default, `kf::default_policy` is set to `kf::accurate_policy`.
110+
111+
Set the preprocessor option `KERNEL_FLOAT_FAST_MATH=1` to change the default policy to `kf::fast_policy`.
112+
This will use fast math for all functions and data types that support it.

docs/guides/example.md

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
# Full CUDA example
2+
3+
This page explains a CUDA program that estimates the value of pi using Kernel Float.
4+
5+
6+
## Overview
7+
8+
The program calculates Pi by generating random points within a unit square and counting how many fall inside the unit circle inscribed within that square. The ratio of points inside the circle to the total number of points approximates Pi/4.
9+
10+
The kernel is shown below:
11+
12+
13+
```c++
14+
namespace kf = kernel_float;
15+
16+
using float_type = float;
17+
static constexpr int VECTOR_SIZE = 4;
18+
19+
__global__ void calculate_pi_kernel(int nx, int ny, int* global_count) {
20+
int thread_x = blockIdx.x * blockDim.x + threadIdx.x;
21+
int thread_y = blockIdx.y * blockDim.y + threadIdx.y;
22+
23+
kf::vec<int, VECTOR_SIZE> xi = thread_x * VECTOR_SIZE + kf::range<int, VECTOR_SIZE>();
24+
kf::vec<int, VECTOR_SIZE> yi = thread_y;
25+
26+
kf::vec<float_type, VECTOR_SIZE> xf = kf::cast<float_type>(xi) / float_type(nx);
27+
kf::vec<float_type, VECTOR_SIZE> yf = kf::cast<float_type>(yi) / float_type(ny);
28+
29+
kf::vec<float_type, VECTOR_SIZE> dist_squared = xf * xf + yf * yf;
30+
kf::vec<float_type, VECTOR_SIZE> dist = kf::sqrt(dist_squared);
31+
32+
int n = kf::count(dist <= float_type(1));
33+
34+
if (n > 0) atomicAdd(global_count, n);
35+
}
36+
```
37+
38+
39+
## Code Explanation
40+
41+
Let's go through the code step by step.
42+
43+
```cpp
44+
// Alias `kernel_float` as `kf`
45+
namespace kf = kernel_float;
46+
```
47+
48+
This creates an alias for `kernel_float`.
49+
50+
```cpp
51+
// Define the float type and vector size
52+
using float_type = float;
53+
static constexpr int VECTOR_SIZE = 4;
54+
```
55+
56+
Define `float_type` as an alias for `float` to make it easy to change precision if needed.
57+
The vector size is set to 4, meaning each thread will process 4 data points.
58+
59+
```cpp
60+
__global__ void calculate_pi_kernel(int nx, int ny, int* global_count) {
61+
```
62+
63+
The CUDA kernel. There are `nx` points along the x axis and `ny` points along the y axis.
64+
65+
```cpp
66+
int thread_x = blockIdx.x * blockDim.x + threadIdx.x;
67+
int thread_y = blockIdx.y * blockDim.y + threadIdx.y;
68+
```
69+
70+
Compute the global x- and y-index of this thread.
71+
72+
```cpp
73+
kf::vec<int, VECTOR_SIZE> xi = thread_x * VECTOR_SIZE + kf::range<int, VECTOR_SIZE>();
74+
kf::vec<int, VECTOR_SIZE> yi = thread_y;
75+
```
76+
77+
Compute the points that this thread will process.
78+
The x coordinates start at `thread_x * VECTOR_SIZE` and then the vector `[0, 1, 2, ..., VECTOR_SIZE-1]`.
79+
The y coordinates are all `thread_y`.
80+
81+
```cpp
82+
kf::vec<float_type, VECTOR_SIZE> xf = kf::cast<float_type>(xi) / float_type(nx);
83+
kf::vec<float_type, VECTOR_SIZE> yf = kf::cast<float_type>(yi) / float_type(ny);
84+
```
85+
86+
Divide `xi` and `yi` by `nx` and `ny` to normalize them to `[0, 1]` range.
87+
88+
```cpp
89+
kf::vec<float_type, VECTOR_SIZE> dist_squared = xf * xf + yf * yf;
90+
```
91+
92+
Compute the squared distance from the origin (0, 0) to each point from `xf`,`yf`.
93+
94+
```cpp
95+
kf::vec<float_type, VECTOR_SIZE> dist = kf::sqrt(dist_squared);
96+
```
97+
98+
Take the element-wise square root.
99+
100+
```cpp
101+
int n = kf::count(dist <= float_type(1));
102+
```
103+
104+
Count the number of points in the unit circle (i.e., for which the distance is less than 4).
105+
The expression `dist <= 1` returns a vector of booleans and `kf::count` counts the number of `true` values.
106+
107+
```cpp
108+
atomicAdd(global_count, n);
109+
```
110+
111+
Add `n` to the `global_count` variable.
112+
This must be done using an atomic operation since multiple thread will write this variable simultaneously.

0 commit comments

Comments
 (0)