Skip to content

Commit bd91b25

Browse files
authored
add ml example (#76)
* add ml example Signed-off-by: vsoch <vsoch@users.noreply.github.com>
1 parent 1d376ab commit bd91b25

File tree

3 files changed

+109
-0
lines changed

3 files changed

+109
-0
lines changed

examples/tests/dlio-ml/README.md

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
# DLIO ML Example
2+
3+
This is an example of using the IO tool[DLIO](https://dlio-profiler.readthedocs.io/en/latest/build.html#build-dlio-profiler-with-pip-recommended) that can
4+
be added on the fly with pip.
5+
6+
## Usage
7+
8+
Create a cluster and install JobSet to it.
9+
10+
```bash
11+
kind create cluster
12+
VERSION=v0.2.0
13+
kubectl apply --server-side -f https://github.com/kubernetes-sigs/jobset/releases/download/$VERSION/manifests.yaml
14+
```
15+
16+
Install the operator (from the development manifest here):
17+
18+
```bash
19+
$ kubectl apply -f ../../dist/metrics-operator-dev.yaml
20+
```
21+
22+
How to see metrics operator logs:
23+
24+
```bash
25+
$ kubectl logs -n metrics-system metrics-controller-manager-859c66464c-7rpbw
26+
```
27+
28+
Then create the metrics set. This is going to run a single run of LAMMPS over MPI.
29+
as lammps runs.
30+
31+
```bash
32+
kubectl apply -f metrics.yaml
33+
```
34+
35+
Wait until you see pods created by the job and then running (there should be two - a launcher and worker for LAMMPS):
36+
37+
```bash
38+
kubectl get pods
39+
```
40+
```diff
41+
NAME READY STATUS RESTARTS AGE
42+
metricset-sample-l-0-0-lt782 1/1 Running 0 3s
43+
metricset-sample-w-0-0-4s5p9 1/1 Running 0 3s
44+
```
45+
46+
In the above, "l" is a launcher pod, and "w" is a worker node.
47+
If you inspect the log for the launcher you'll see a short sleep (the network isn't up immediately)
48+
and then LAMMPS running, and the log is printed to the console.
49+
50+
```bash
51+
kubectl logs metricset-sample-l-0-0-lt782 -f
52+
```
53+
54+
There is purposefully a sleep infinity at the end to give you a chance to copy over data.
55+
56+
```bash
57+
mkdir -p ./data ./output
58+
# Only if you are interested in the ML data
59+
kubectl cp metricset-sample-m-0-0-xfg6r:/dlio/data ./data/
60+
kubectl cp metricset-sample-m-0-0-xfg6r:/dlio/output ./output
61+
```
62+
63+
You can open the tiny file in [https://ui.perfetto.dev/](https://ui.perfetto.dev/).
64+
65+
![img/trace.png](img/trace.png)
66+
67+
Other applications of interest might be related to AI/ML - we will try more soon!
68+
Cleanup when you are done.
69+
70+
```bash
71+
kubectl delete -f metrics.yaml
72+
```

examples/tests/dlio-ml/img/trace.png

52.5 KB
Loading

examples/tests/dlio-ml/metrics.yaml

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
apiVersion: flux-framework.org/v1alpha2
2+
kind: MetricSet
3+
metadata:
4+
labels:
5+
app.kubernetes.io/name: metricset
6+
app.kubernetes.io/instance: metricset-sample
7+
name: metricset-sample
8+
spec:
9+
# kubectl apply -f metrics.yaml
10+
# kubectl logs <launcher-pod> -f
11+
pods: 1
12+
13+
metrics:
14+
- name: io-ior
15+
options:
16+
command: mpirun --allow-run-as-root -np 10 dlio_benchmark workload=resnet50 ++workload.dataset.data_folder=/dlio/data ++workload.output.folder=/dlio/output
17+
workdir: /dlio/data
18+
addons:
19+
- name: commands
20+
options:
21+
preBlock: |
22+
apt-get update && apt-get install -y python3 python3-pip openmpi-bin openmpi-common libopenmpi-dev hwloc libhwloc-dev default-jre
23+
#python3 -m pip install git+https://github.com/hariharan-devarajan/dlio-profiler.git
24+
#python3 -m pip install git+https://github.com/argonne-lcf/dlio_benchmark.git
25+
python3 -m pip install "dlio_benchmark[dlio_profiler] @ git+https://github.com/argonne-lcf/dlio_benchmark.git"
26+
mkdir -p /dlio/data /dlio/output /dlio/logs
27+
export DLIO_PROFILER_ENABLE=0
28+
mpirun -np 10 --allow-run-as-root dlio_benchmark workload=resnet50 ++workload.dataset.data_folder=/dlio/data ++workload.output.folder=/dlio/output ++workload.workflow.generate_data=True ++workload.workflow.train=False
29+
export DLIO_PROFILER_LOG_LEVEL=ERROR
30+
export DLIO_PROFILER_ENABLE=1
31+
export DLIO_PROFILER_INC_METADATA=1
32+
cd /dlio/data
33+
postBlock: |
34+
gzip -d /dlio/output/.trace*.pfw.gz
35+
cat /dlio/output/.trace*.pfw
36+
sleep infinity
37+

0 commit comments

Comments
 (0)