You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Your runs must in any case have one common demoniator, that has at max two values. For instance:
75
+
- Different repositories and branches but run on one machine
76
+
- Many different machines and branches, but only two different repositories
77
+
78
+
## Statistical significance
79
+
80
+
### Comparing runs with differentiating features
81
+
When running a comparison between different commits, different machines etc. the GMT will
82
+
also compute a *T-test* for the two samples.
83
+
84
+
It will calculate the *T-test* for the means of two independent samples of scores assuming even independent variances. (Some might know this test also as *Welch’s t-test* or *Welch test*)
85
+
86
+
If the *p-value* is lower than **0.05** GMT will show the result as significant.
87
+
88
+
GMT will provide the *p-value* directly in the API output of the comparison.
89
+
In the frontend it will be shown with a green / red indicator for the significance. Green meaning significant.
90
+
Or it will tell you if a comparison could not be made in case there where too many missing values or the metric was not present in all runs.
When running a comparison of repeated runs with no diffentiating criteria like different commits, repos etc. the GMT will run a *1-sample T-test*.
98
+
Effectivly answering the question: "Did the last run in the set of repeated runs have a siginificant variation to the ones before".
99
+
100
+
This question is very typical as you will have a set of a couple of runs once measured. Then you come back to your code and just re-measure out. The value is now different and you want to tell if it is *significantly different*.
101
+
102
+
If the *p-value* is lower than **0.05** GMT will show the result as significant.
103
+
104
+
GMT will provide the *p-value* directly in the API output of the comparison.
105
+
In the frontend it will be shown with a green / red indicator for the significance. Green meaning significant.
106
+
Or it will tell you if a comparison could not be made in case there where too many missing values or the metric was not present in all runs.
GMT is container native when it comes to orchestrating the application and capturing performance metrics.
11
+
But typically the energy of the [Metric Providers]({{< relref "/docs/measuring/metric-providers/" >}}) are system level.
12
+
13
+
If you want to drill down the energy on a per-container level GMT offers to create an estmation based on the CPU utilization of the system.
14
+
15
+
### Setting up container estimations
16
+
17
+
Prerequisites
18
+
- You must have a PSU [Metric Provider]({{< relref "/docs/measuring/metric-providers/" >}}) activated
19
+
- You must have a System level CPU Utilization [Metric Provider]({{< relref "/docs/measuring/metric-providers/" >}}) activated
20
+
- You must have a Container level CPU Utilization [Metric Provider]({{< relref "/docs/measuring/metric-providers/" >}}) activated
21
+
22
+
GMT will then
23
+
24
+
- Takes the baseline energy value of the machine
25
+
- Takes the runtime energy value of the machine
26
+
- Creates a difference
27
+
- Splits the resulting difference proportionally to the individual container's CPU% in relation the other containers CPU% share
28
+
29
+
Example:
30
+
31
+
<imgclass="ui centered rounded bordered"src="/img/measuring/container_power_attribution.webp"alt="Container Power Attribution">
32
+
33
+
The difference between *Container Power* and *Container Power (+Baseline Share)* is that the *Container Power* is the overhead / additional power additional to the power that the system was already drawing during the baseline when no containers where launched.
34
+
35
+
*ontainer Power (+Baseline Share)* includes the attributional share of the baseline load also. It is split also according to the container's CPU% in relation the machine's CPU% share.
36
+
This means that a container that has 20% CPU-Utilization in comparison to the other containers will also get 20% of the baseline power draw attributed.
37
+
38
+
39
+
#### Requirements for reproducibility
40
+
41
+
This method only works if the baseline is long enough (the cluster ensures this with a 60-second timeframe) and the CPU is set to a fixed frequency without Hyperthreading and TurboBoost. Otherwise, you'll get incorrect allocations, as, for example, "30% CPU Utilization" no longer has a clear meaning due to the limited cycle count.
42
+
43
+
However, there is a good approximation.
44
+
45
+
46
+
#### Limitations
47
+
48
+
The value is still shaky, because although utilization is more stable with a controlled cluster setup, it's not as good as, for example, CPU instructions (which would require PMU sampling), and non-CPU energy is only considered indirectly.
49
+
50
+
So, if you execute strange CPU instructions, such as AVX instructions or CPU steal time, or if you have a hard drive that executes asynchronous workloads like TRIM independently of CPU instructions, this will distort the energy evaluation.
51
+
52
+
53
+
54
+
#### Addon: Detailed CPU energy drill-down
55
+
56
+
We currently have a beta feature to be launched in Summer 2025 that utilizes AMD RAPL per-core energy registers and core-pinning to report individial container CPU energy metrics.
57
+
58
+
The GMT will utilize a `taskset` to pin the container to a distinct core. Since no other processes are running on your benchmarking systems in [Green Metrics Tool Cluster Hosted Service →]({{< relref "/docs/measuring/measuring-service/" >}}) the values are very reliable.
59
+
60
+
Private beta opens Summer 2025. If you are interested shoot us an email to [info@green-coding.io](mailto:info@green-coding.io)
61
+
- The energy of the browser is measured to display and render the page
62
+
- The network transfer energy is measured that was needed to download the HTML and page assets
63
+
64
+
To isolate this as best as possible GMT orchestrates a reverse proxy, warms up the cache by pre-loading the full page once and only then does the final measurement.
0 commit comments