Skip to content

Commit 67ad62f

Browse files
authored
[wip] second design for metrics operator (#63)
* WIP to refactor This is going to be a huge refactor to remove the application/storage "hard coded" legos replaced by a more flexible setup where we have one base metric set (no subtypes) and then metrics generate the replicated jobs (as many as they like, how they please) and then addons are provided to them, which can range from additional volumes to containers (that provide volumes) to any kind of customization. This is not ready for any kind of testing but I am mostly concerned about my computer blowing up and losing the work so I am saving for good measure :) Also, yay today! :D * definitely making bad life decisions * very satisfying deletion of things. * lammps ran! * amg is back * bdas is back * add back hpl we did not get this completely working before (likely the spack mpi install as a basic hostname does not work ) so a basic conversion is sufficient * add back kripke * laghos * test signing again * add back nekbone * add back pennant * add back quicksilver also simplify logic of applications - the launcher worker pattern is generic and can be shared * workflow format bug * add back fio * add back host volume example * add back ior * add back osu benchmarks! * add back chatterbug it is accepted this does not fully work, we need to come back to it. * add back netmark * systat and lammps working again * hpctoolkit design at least works but shared libraries are failing to load. HPCToolkit you are a jerk. I am laughing. And crying. And mostly crying. * clean up docs a little bit * addon documentation is good * hopefully fix bug * fixing workingdir bug! * update to v1alpha2 * bugfix * a single touch marker at the end of the copy is more reliable than a file that is part of it! * support to customize container for any metric, and for hpctoolkit to run post commands * support for custom container * add print at end of post analysis for hpctoolkit * fixing bug with internal crd state if we do not make a copy (refect) of the interface, the state seems to change (and perist) between runs. While I am still worried about this design, this at least seems to fix that bug. I am also wondering about garbage collection (e.g., if making the copies means they stay around and the operator will use increasing memory) but that is TBA explored. Signed-off-by: vsoch <vsoch@users.noreply.github.com>
1 parent 24980db commit 67ad62f

File tree

134 files changed

+4604
-4683
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

134 files changed

+4604
-4683
lines changed

.github/workflows/main.yaml

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ jobs:
1616
- name: Check Spelling
1717
uses: crate-ci/typos@7ad296c72fa8265059cc03d1eda562fbdfcd6df2 # v1.9.0
1818
with:
19-
files: ./README.md ./config/samples ./docs/*.md ./docs/*/*.md
19+
files: ./README.md ./docs/*.md ./docs/*/*.md ./docs/*/*/*.md
2020

2121
- name: Lint and format Python code
2222
run: |
@@ -66,19 +66,19 @@ jobs:
6666
strategy:
6767
fail-fast: false
6868
matrix:
69-
test: [["perf-hello-world", "ghcr.io/converged-computing/metric-sysstat:latest", 60], # performance test
70-
["io-host-volume", "ghcr.io/converged-computing/metric-sysstat:latest", 60], # storage test
71-
["io-fio", "ghcr.io/converged-computing/metric-fio:latest", 120], # storage test
72-
["io-ior", "ghcr.io/converged-computing/metric-ior:latest", 120], # storage test
73-
# ["network-chatterbug", "ghcr.io/converged-computing/metric-chatterbug:latest", 120], # network app test
74-
["app-nekbone", "ghcr.io/converged-computing/metric-nekbone:latest", 120], # standalone app test
75-
# ["app-ldms", "ghcr.io/converged-computing/metric-ovis-hpc:latest", 120], # standalone app test
76-
["app-amg", "ghcr.io/converged-computing/metric-amg:latest", 120], # standalone app test
77-
["app-kripke", "ghcr.io/converged-computing/metric-kripke:latest", 120], # standalone app test
78-
["app-pennant", "ghcr.io/converged-computing/metric-pennant:latest", 120], # standalone app test
79-
["app-bdas", "ghcr.io/converged-computing/metric-bdas:latest", 120], # standalone app test
80-
["app-quicksilver", "ghcr.io/converged-computing/metric-quicksilver:latest", 120], # standalone app test
81-
["app-lammps", "ghcr.io/converged-computing/metric-lammps:latest", 120]] # standalone app test
69+
test: [["app-lammps", "ghcr.io/converged-computing/metric-lammps:latest", 120],
70+
["perf-hello-world", "ghcr.io/converged-computing/metric-sysstat:latest", 60],
71+
["io-host-volume", "ghcr.io/converged-computing/metric-sysstat:latest", 60],
72+
["io-fio", "ghcr.io/converged-computing/metric-fio:latest", 120],
73+
["io-ior", "ghcr.io/converged-computing/metric-ior:latest", 120],
74+
## ["network-chatterbug", "ghcr.io/converged-computing/metric-chatterbug:latest", 120],
75+
["app-nekbone", "ghcr.io/converged-computing/metric-nekbone:latest", 120],
76+
["app-ldms", "ghcr.io/converged-computing/metric-ovis-hpc:latest", 120],
77+
["app-amg", "ghcr.io/converged-computing/metric-amg:latest", 120],
78+
["app-kripke", "ghcr.io/converged-computing/metric-kripke:latest", 120],
79+
["app-pennant", "ghcr.io/converged-computing/metric-pennant:latest", 120],
80+
["app-bdas", "ghcr.io/converged-computing/metric-bdas:latest", 120],
81+
["app-quicksilver", "ghcr.io/converged-computing/metric-quicksilver:latest", 120]]
8282

8383
steps:
8484
- name: Clone the code

.github/workflows/python.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ jobs:
2727
run: |
2828
export PATH="/usr/share/miniconda/bin:$PATH"
2929
source activate mo
30-
cd sdk/python/v1alpha1
30+
cd sdk/python/v1alpha2
3131
pip install .
3232
pip install seaborn pandas
3333

.github/workflows/release.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -106,7 +106,7 @@ jobs:
106106
run: |
107107
export PATH="/usr/share/miniconda/bin:$PATH"
108108
source activate mo
109-
cd sdk/python/v1alpha1/
109+
cd sdk/python/v1alpha2/
110110
pip install -e .
111111
python setup.py sdist bdist_wheel
112112
cd dist

Makefile

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -323,7 +323,8 @@ helm: manifests kustomize helmify
323323

324324
.PHONY: docs-data
325325
docs-data:
326-
go run hack/docs-gen/main.go docs/_static/data/metrics.json
326+
go run hack/metrics-gen/main.go docs/_static/data/metrics.json
327+
go run hack/addons-gen/main.go docs/_static/data/addons.json
327328

328329
.PHONY: pre-push
329330
pre-push: generate build-config-arm build-config docs-data

PROJECT

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,6 @@ resources:
1717
controller: true
1818
domain: flux-framework.org
1919
kind: MetricSet
20-
path: github.com/converged-computing/metrics-operator/api/v1alpha1
21-
version: v1alpha1
20+
path: github.com/converged-computing/metrics-operator/api/v1alpha2
21+
version: v1alpha2
2222
version: "3"

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ To learn more:
1212

1313
## Dinosaur TODO
1414

15+
- Figure out issue with errors.IsNotFound not working...
1516
- We need a way for the entrypoint command to monitor (based on the container) to differ (potentially)
1617
- For larger metric collections, we should have a log streaming mode (and not wait for Completed/Successful)
1718
- For services we are measuring, we likely need to be able to kill after N seconds (to complete job) or to specify the success policy on the metrics containers instead of the application

api/v1alpha1/groupversion_info.go renamed to api/v1alpha2/groupversion_info.go

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,10 +14,10 @@ See the License for the specific language governing permissions and
1414
limitations under the License.
1515
*/
1616

17-
// Package v1alpha1 contains API Schema definitions for the v1alpha1 API group
17+
// Package v1alpha2 contains API Schema definitions for the v1alpha2 API group
1818
// +kubebuilder:object:generate=true
1919
// +groupName=flux-framework.org
20-
package v1alpha1
20+
package v1alpha2
2121

2222
import (
2323
"k8s.io/apimachinery/pkg/runtime/schema"
@@ -26,7 +26,7 @@ import (
2626

2727
var (
2828
// GroupVersion is group version used to register these objects
29-
GroupVersion = schema.GroupVersion{Group: "flux-framework.org", Version: "v1alpha1"}
29+
GroupVersion = schema.GroupVersion{Group: "flux-framework.org", Version: "v1alpha2"}
3030

3131
// SchemeBuilder is used to add go types to the GroupVersionKind scheme
3232
SchemeBuilder = &scheme.Builder{GroupVersion: GroupVersion}

0 commit comments

Comments
 (0)