Skip to content

Commit 4ddb4d1

Browse files
committed
Deployed 96ba949 with MkDocs version: 1.6.1
1 parent 71b20f9 commit 4ddb4d1

File tree

7 files changed

+114
-13
lines changed

7 files changed

+114
-13
lines changed

debug/index.html

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -128,6 +128,18 @@
128128

129129
<li class="nav-item" data-bs-level="1"><a href="#debug" class="nav-link">Debug</a>
130130
<ul class="nav flex-column">
131+
<li class="nav-item" data-bs-level="2"><a href="#config-stack-deploys-but-parallelcluster-stack-doesnt" class="nav-link">Config stack deploys, but ParallelCluster stack doesn't</a>
132+
<ul class="nav flex-column">
133+
</ul>
134+
</li>
135+
<li class="nav-item" data-bs-level="2"><a href="#parallelcluster-stack-creation-fails" class="nav-link">ParallelCluster stack creation fails</a>
136+
<ul class="nav flex-column">
137+
<li class="nav-item" data-bs-level="3"><a href="#headnodewaitcondition-failed-to-create" class="nav-link">HeadNodeWaitCondition failed to create</a>
138+
<ul class="nav flex-column">
139+
</ul>
140+
</li>
141+
</ul>
142+
</li>
131143
<li class="nav-item" data-bs-level="2"><a href="#slurm-head-node" class="nav-link">Slurm Head Node</a>
132144
<ul class="nav flex-column">
133145
</ul>
@@ -157,6 +169,21 @@
157169

158170
<h1 id="debug">Debug</h1>
159171
<p>For ParallelCluster and Slurm issues, refer to the official <a href="https://docs.aws.amazon.com/parallelcluster/latest/ug/troubleshooting-v3.html">AWS ParallelCluster Troubleshooting documentation</a>.</p>
172+
<h2 id="config-stack-deploys-but-parallelcluster-stack-doesnt">Config stack deploys, but ParallelCluster stack doesn't</h2>
173+
<p>This happens when the lambda function that create the cluster encounters an error.
174+
This is usually some kind of configuration error that is detected by ParallelCluster.</p>
175+
<ul>
176+
<li>Open the CloudWatch console and go the log groups</li>
177+
<li>Find the log group named /aws/lambda/*-CreateParallelCluster</li>
178+
<li>Look for the error</li>
179+
</ul>
180+
<h2 id="parallelcluster-stack-creation-fails">ParallelCluster stack creation fails</h2>
181+
<h3 id="headnodewaitcondition-failed-to-create">HeadNodeWaitCondition failed to create</h3>
182+
<p>If the stack fails with an error like:</p>
183+
<p><code>The following resoure(s) failed to create
184+
[HeadNodeWaitCondition2025050101134602]</code></p>
185+
<p>Connect to the head node and look in <code>/var/log/ansible.log</code> for errors.</p>
186+
<p>If it shows that it failed waiting for slurmctld to accept requests then check <code>/var/log/slurmctld.log</code> for errors.</p>
160187
<h2 id="slurm-head-node">Slurm Head Node</h2>
161188
<p>If slurm commands hang, then it's likely a problem with the Slurm controller.</p>
162189
<p>Connect to the head node from the EC2 console using SSM Manager or ssh and switch to the root user.</p>

deployment-prerequisites/index.html

Lines changed: 13 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -259,11 +259,11 @@ <h2 id="make-sure-using-at-least-python-version-37">Make sure using at least pyt
259259
Simply install the newer version and then use it to create and activate a virtual environment.</p>
260260
<pre><code>$ python3 --version
261261
Python 3.6.8
262-
$ yum -y install python3.11
263-
$ python3.11 -m venv ~/.venv-python3.11
264-
$ source ~/.venv-python3.11/bin/activate
262+
$ yum -y install python3.12
263+
$ python3.12 -m venv ~/.venv-python3.12
264+
$ source ~/.venv-python3.12/bin/activate
265265
$ python3 --version
266-
Python 3.11.5
266+
Python 3.12.8
267267
</code></pre>
268268
<h2 id="make-sure-required-packages-are-installed">Make sure required packages are installed</h2>
269269
<pre><code>cd aws-eda-slurm-cluster
@@ -283,10 +283,10 @@ <h3 id="install-cloud-development-kit-cdk-optional">Install Cloud Development Ki
283283
<p>The following link documents how to setup for CDK.
284284
Follow the instructions for Python.</p>
285285
<p><a href="https://docs.aws.amazon.com/cdk/v2/guide/getting_started.html#getting_started_prerequisites">https://docs.aws.amazon.com/cdk/v2/guide/getting_started.html#getting_started_prerequisites</a></p>
286-
<p>Note that CDK requires a pretty new version of nodejs which you may have to download from, for example, <a href="https://nodejs.org/dist/v16.13.1/node-v16.13.1-linux-x64.tar.xz">https://nodejs.org/dist/v16.13.1/node-v16.13.1-linux-x64.tar.xz</a></p>
286+
<p>Note that CDK requires a pretty new version of nodejs which you may have to download from, for example, <a href="https://nodejs.org/dist/v20.19.0/node-v20.19.0-linux-x64.tar.xz">https://nodejs.org/dist/v20.19.0/node-v20.19.0-linux-x64.tar.xz</a></p>
287287
<pre><code>sudo yum -y install wget
288-
wget https://nodejs.org/dist/v16.13.1/node-v16.13.1-linux-x64.tar.xz
289-
tar -xf node-v16.13.1-linux-x64.tar.xz ~
288+
wget https://nodejs.org/dist/v20.19.0/node-v20.19.0-linux-x64.tar.xz
289+
tar -xf node-v20.19.0-linux-x64.tar.xz ~
290290
</code></pre>
291291
<p>Add the nodjs bin directory to your path.</p>
292292
<p><a href="https://docs.aws.amazon.com/cdk/v2/guide/getting_started.html#getting_started_install">https://docs.aws.amazon.com/cdk/v2/guide/getting_started.html#getting_started_install</a></p>
@@ -323,6 +323,12 @@ <h2 id="create-slurmdbd-instance">Create Slurmdbd Instance</h2>
323323
<p><strong>Note</strong>: Before ParallelCluster 3.10.0, the slurmdbd daemon that connects to the data was created on each cluster's head node.
324324
The recommended Slurm architecture is to have a shared slurmdbd daemon that is used by all of the clusters.
325325
Starting in version 3.10.0, ParallelCluster supports specifying an external slurmdbd instance when you create a cluster and provide a cloud formation template to create it.</p>
326+
<p><strong>Note</strong>: The Slurm version used by slurmdbd must be greater than or equal to the version of your clusters.
327+
If you have already deployed a slurmdbd instance then you will need to create a new slurmdbd
328+
instance with the latest version of ParallelCluster.
329+
Also note that Slurm only maintains backwards compatibility for the 2 previous major releases so
330+
at some point you will need upgrade your clusters to newer versions before you can use the latest version
331+
of ParallelCluster.</p>
326332
<p>Follow the directions in this <a href="https://docs.aws.amazon.com/parallelcluster/latest/ug/external-slurmdb-accounting.html#external-slurmdb-accounting-step1">ParallelCluster tutorial to configure slurmdbd</a>.
327333
This requires that you have already created the slurm database.</p>
328334
<p>Here are some notes on the required parameters and how to fill them out.</p>

exostellar-infrastructure-optimizer/index.html

Lines changed: 36 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -170,6 +170,14 @@
170170
</li>
171171
<li class="nav-item" data-bs-level="2"><a href="#debug" class="nav-link">Debug</a>
172172
<ul class="nav flex-column">
173+
<li class="nav-item" data-bs-level="3"><a href="#how-to-connect-to-ems" class="nav-link">How to connect to EMS</a>
174+
<ul class="nav flex-column">
175+
</ul>
176+
</li>
177+
<li class="nav-item" data-bs-level="3"><a href="#how-to-connect-to-controller" class="nav-link">How to connect to Controller</a>
178+
<ul class="nav flex-column">
179+
</ul>
180+
</li>
173181
<li class="nav-item" data-bs-level="3"><a href="#updateheadnode-resource-failed" class="nav-link">UpdateHeadNode resource failed</a>
174182
<ul class="nav flex-column">
175183
</ul>
@@ -808,6 +816,19 @@ <h3 id="run-a-test-job-using-slurm">Run a test job using Slurm</h3>
808816
<pre><code>srun --pty -p xio-
809817
</code></pre>
810818
<h2 id="debug">Debug</h2>
819+
<h3 id="how-to-connect-to-ems">How to connect to EMS</h3>
820+
<p>Use ssh to connect to the EMS using your EC2 keypair.</p>
821+
<ul>
822+
<li><code>ssh-add private-key.pem</code></li>
823+
<li><code>ssh -A rocky@${EMS_IP_ADDRESS}</code></li>
824+
</ul>
825+
<p>You can <a href="https://docs.aws.amazon.com/systems-manager/latest/userguide/agent-install-rocky.html">install the aws-ssm-agent</a> so that you can connect from the EC2 console using SSM.</p>
826+
<h3 id="how-to-connect-to-controller">How to connect to Controller</h3>
827+
<ul>
828+
<li>First ssh to the EMS.</li>
829+
<li>Get the IP address of the controller from the EC2 console</li>
830+
<li>As root, ssh to the controller</li>
831+
</ul>
811832
<h3 id="updateheadnode-resource-failed">UpdateHeadNode resource failed</h3>
812833
<p>If the UpdateHeadNode resource fails then it is usually because as task in the ansible script failed.
813834
Connect to the head node and look for errors in:</p>
@@ -816,17 +837,30 @@ <h3 id="updateheadnode-resource-failed">UpdateHeadNode resource failed</h3>
816837
<p>When this happens the CloudFormation stack will usually be in UPDATE_ROLLBACK_FAILED status.
817838
Before you can update it again you will need to complete the rollback.
818839
Go to Stack Actions, select <code>Continue update rollback</code>, expand <code>Advanced troubleshooting</code>, check the UpdateHeadNode resource, anc click <code>Continue update rollback</code>.</p>
840+
<p>The problem is usually that there is an XWO controller running that is preventing updates to
841+
the profile.
842+
Cancel any XWO jobs and terminate any running workers and controllers and verify that all of the XWO profiles are idle.</p>
819843
<h3 id="xio-controller-not-starting">XIO Controller not starting</h3>
820844
<p>On EMS, check that a job is running to create the controller.</p>
821845
<p><code>squeue</code></p>
822846
<p>On EMS, check the autoscaling log to see if there are errors starting the instance.</p>
823847
<p><code>less /var/log/slurm/autoscaling.log</code></p>
824-
<p>EMS Slurm partions are at:</p>
848+
<p>EMS Slurm partitions are at:</p>
825849
<p><code>/xcompute/slurm/bin/partitions.json</code></p>
826850
<p>They are derived from the partition and pool names.</p>
827851
<h3 id="worker-instance-not-starting">Worker instance not starting</h3>
828852
<h3 id="vm-not-starting-on-worker">VM not starting on worker</h3>
829-
<h3 id="vm-not-starting-slurm-job">VM not starting Slurm job</h3></div>
853+
<p>Connect to the controller instance and run the following command to get a list of worker instances and VMs.</p>
854+
<pre><code>xspot ps
855+
</code></pre>
856+
<p>Connect to the worker VM using the following command.</p>
857+
<pre><code>xspot console vm-abcd
858+
</code></pre>
859+
<p>This will show the console logs.
860+
If you configured the root password then you can log in as root to do further debug.</p>
861+
<h3 id="vm-not-starting-slurm-job">VM not starting Slurm job</h3>
862+
<p>Connect to the VM as above.</p>
863+
<p>Check /var/log/slurmd.log for errors.</p></div>
830864
</div>
831865
</div>
832866

exostellar-workload-optimizer/index.html

Lines changed: 36 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -174,6 +174,14 @@
174174
</li>
175175
<li class="nav-item" data-bs-level="2"><a href="#debug" class="nav-link">Debug</a>
176176
<ul class="nav flex-column">
177+
<li class="nav-item" data-bs-level="3"><a href="#how-to-connect-to-ems" class="nav-link">How to connect to EMS</a>
178+
<ul class="nav flex-column">
179+
</ul>
180+
</li>
181+
<li class="nav-item" data-bs-level="3"><a href="#how-to-connect-to-controller" class="nav-link">How to connect to Controller</a>
182+
<ul class="nav flex-column">
183+
</ul>
184+
</li>
177185
<li class="nav-item" data-bs-level="3"><a href="#updateheadnode-resource-failed" class="nav-link">UpdateHeadNode resource failed</a>
178186
<ul class="nav flex-column">
179187
</ul>
@@ -343,6 +351,19 @@ <h3 id="run-a-test-job-using-slurm">Run a test job using Slurm</h3>
343351
<pre><code>srun --pty -p xwo-amd-64g-4c hostname
344352
</code></pre>
345353
<h2 id="debug">Debug</h2>
354+
<h3 id="how-to-connect-to-ems">How to connect to EMS</h3>
355+
<p>Use ssh to connect to the EMS using your EC2 keypair.</p>
356+
<ul>
357+
<li><code>ssh-add private-key.pem</code></li>
358+
<li><code>ssh -A rocky@${EMS_IP_ADDRESS}</code></li>
359+
</ul>
360+
<p>You can <a href="https://docs.aws.amazon.com/systems-manager/latest/userguide/agent-install-rocky.html">install the aws-ssm-agent</a> so that you can connect from the EC2 console using SSM.</p>
361+
<h3 id="how-to-connect-to-controller">How to connect to Controller</h3>
362+
<ul>
363+
<li>First ssh to the EMS.</li>
364+
<li>Get the IP address of the controller from the EC2 console</li>
365+
<li>As root, ssh to the controller</li>
366+
</ul>
346367
<h3 id="updateheadnode-resource-failed">UpdateHeadNode resource failed</h3>
347368
<p>If the UpdateHeadNode resource fails then it is usually because a task in the ansible script failed.
348369
Connect to the head node and look for errors in:</p>
@@ -351,6 +372,9 @@ <h3 id="updateheadnode-resource-failed">UpdateHeadNode resource failed</h3>
351372
<p>When this happens the CloudFormation stack will usually be in UPDATE_ROLLBACK_FAILED status.
352373
Before you can update it again you will need to complete the rollback.
353374
Go to Stack Actions, select <code>Continue update rollback</code>, expand <code>Advanced troubleshooting</code>, check the UpdateHeadNode resource, anc click <code>Continue update rollback</code>.</p>
375+
<p>The problem is usually that there is an XWO controller running that is preventing updates to
376+
the profile.
377+
Cancel any XWO jobs and terminate any running workers and controllers and verify that all of the XWO profiles are idle.</p>
354378
<h3 id="xwo-controller-not-starting">XWO Controller not starting</h3>
355379
<p>If a controller doesn't start, then the first thing to check is to make sure that the
356380
<code>/opt/slurm/exostellar/resume_xspot.sh</code> script ran successfully on the head node.</p>
@@ -361,12 +385,22 @@ <h3 id="xwo-controller-not-starting">XWO Controller not starting</h3>
361385
<p><code>squeue</code></p>
362386
<p>On EMS, check the autoscaling log to see if there are errors starting the instance.</p>
363387
<p><code>less /var/log/slurm/autoscaling.log</code></p>
364-
<p>EMS Slurm partions are at:</p>
388+
<p>EMS Slurm partitions are at:</p>
365389
<p><code>/xcompute/slurm/bin/partitions.json</code></p>
366390
<p>They are derived from the partition and pool names.</p>
367391
<h3 id="worker-instance-not-starting">Worker instance not starting</h3>
368392
<h3 id="vm-not-starting-on-worker">VM not starting on worker</h3>
369-
<h3 id="vm-not-starting-slurm-job">VM not starting Slurm job</h3></div>
393+
<p>Connect to the controller instance and run the following command to get a list of worker instances and VMs.</p>
394+
<pre><code>xspot ps
395+
</code></pre>
396+
<p>Connect to the worker VM using the following command.</p>
397+
<pre><code>xspot console vm-abcd
398+
</code></pre>
399+
<p>This will show the console logs.
400+
If you configured the root password then you can log in as root to do further debug.</p>
401+
<h3 id="vm-not-starting-slurm-job">VM not starting Slurm job</h3>
402+
<p>Connect to the VM as above.</p>
403+
<p>Check /var/log/slurmd.log for errors.</p></div>
370404
</div>
371405
</div>
372406

index.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -329,5 +329,5 @@ <h4 class="modal-title" id="keyboardModalLabel">Keyboard Shortcuts</h4>
329329

330330
<!--
331331
MkDocs version : 1.6.1
332-
Build Date UTC : 2025-04-30 21:30:56.431142+00:00
332+
Build Date UTC : 2025-05-12 17:13:22.467008+00:00
333333
-->

search/search_index.json

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

sitemap.xml.gz

0 Bytes
Binary file not shown.

0 commit comments

Comments
 (0)