Skip to content

Conversation

jiaxiyan
Copy link
Contributor

@jiaxiyan jiaxiyan commented Sep 24, 2025

Share the domain among BTL and MTL to reduce the number of domains needed. This prevents hitting the resource limit on systems with high core counts.

Share the domain among BTL and MTL to reduce the number of domains needed.
This prevents hitting the resource limit on systems with high core counts.

Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
MTL is initialized and finalized before BTL.
When MTL and BTL share the same fabric and domain, BTL must be responsible
for closing them to ensure the correct order of resource finalization.

This commit adds flags to indicate whether the fabric and domain are shared
with BTL, allowing MTL to skip closing these objects during its finalization.

Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
@jiaxiyan
Copy link
Contributor Author

@bwbarrett @hppritcha Can you review this?

Move cm before ob1 during pml selection to ensure MTL initialization
occurs before BTL initialization. This addresses non-deterministic
component ordering caused by filesystem discovery that could
result in ob1 listed before cm.

This allows BTL to reuse MTL's fabric and domain.

Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
#include "opal/mca/btl/btl.h"
#include "opal/mca/common/ofi/common_ofi.h"
#include "opal/mca/hwloc/base/base.h"
#include "ompi/mca/mtl/base/base.h"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an abstraction break. The BTL can't include MTL headers. So whatever you really need here needs to be logic that ends up in the common/ofi component.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants