Skip to content

Data size based aggregation #4548

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 48 commits into
base: master
Choose a base branch
from

Conversation

scottwittenburg
Copy link
Collaborator

Goal: Add a new aggregation approach on the writer side, where ranks are grouped into MPIChains of roughly equal data size, for writing.

  • Add new helper module to handle partitioning of ranks based on data size. It should be easy to add new partitioning strategies, and eventually we should allow selecting one at runtime using config options
  • Add an MPIChain constructor that will put ranks into substreams based on a computed data-size based partitioning of the ranks
  • Update BP5Writer to collect some member fields into a new structure containing the things that should be cached. The purpose of this is to avoid opening the same files repeatedly for subsequent time steps, when it can be avoided.
  • Update BP5Writer initialization logic to support data-size based aggregation (which is off by default and enabled with a configuration setting).

@scottwittenburg
Copy link
Collaborator Author

fyi @pnorbert and @eisenhauer

This is my strawman for data-size based aggregation. Over the next couple days I plan to comment in various locations in the changeset to ask questions about direction this approach is taking, with the goal of getting your feedback.

@scottwittenburg scottwittenburg force-pushed the data-size-based-aggregation branch from 758fb60 to 1956453 Compare June 13, 2025 00:55
useProfiler, *DataWritingComm);
}

// If we got a cache hit above, we'd like to skip the enclosed OpenFiles
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the places where I tried to avoid opening files in the case of a cache hit, but it results in deadlock.

@pnorbert @eisenhauer

@@ -121,6 +125,12 @@ bp_gtest_add_tests_helper(WriteReadAttributesMultirank MPI_ALLOW)
bp_gtest_add_tests_helper(LargeMetadata MPI_ALLOW)
bp5_gtest_add_tests_helper(WriteStatsOnly MPI_ALLOW)

if (ADIOS2_HAVE_MPI)
gtest_add_tests_helper(DataSizeAggregate MPI_ONLY BP Engine.BP. .BP5.DSB
WORKING_DIRECTORY ${BP5_DIR} EXTRA_ARGS "DataSizeBased" 2 1
Copy link
Collaborator Author

@scottwittenburg scottwittenburg Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extra args passed here are the aggregation type, the desired number of subfiles, and the number of time steps to iterate. Run with any number of subfiles and a single time step, the test passes. Run with 2 timesteps results in failure when trying to read the data back in.

2 steps failure details
$ /opt/mpi/mpich/4.1.2/install/bin/mpiexec "-n" "4" "/home/local/KHQ/scott.wittenburg/projects/EfficientInSituIO/fides_adios/adios2/build/bin/Test.Engine.BP.DataSizeAggregate.MPI" "--gtest_filter=DSATest.TestWriteUnbalancedData" "DataSizeBased" "2" "2"
Note: Google Test filter = DSATest.TestWriteUnbalancedData
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from DSATest
[ RUN      ] DSATest.TestWriteUnbalancedData
Note: Google Test filter = DSATest.TestWriteUnbalancedData
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from DSATest
[ RUN      ] DSATest.TestWriteUnbalancedData
Note: Google Test filter = DSATest.TestWriteUnbalancedData
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from DSATest
[ RUN      ] DSATest.TestWriteUnbalancedData
Note: Google Test filter = DSATest.TestWriteUnbalancedData
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from DSATest
[ RUN      ] DSATest.TestWriteUnbalancedData
Number of timesteps: 2
Aggregation type: DataSizeBased
Number of subfiles: 2
Rank data sizes (in number of columns)
1 2 3 4 
 --- Constructing BP5Writer (0)--- 
 --- Constructing BP5Writer (2)--- 
 --- Constructing BP5Writer (3)--- 
 BP5Writer::0::BeginStep(StepMode::Append, -1) 
 BP5Writer::2::BeginStep(StepMode::Append, -1) 
 BP5Writer::3::BeginStep(StepMode::Append, -1) 
 --- Constructing BP5Writer (1)--- 
 BP5Writer::0::EndStep() 
 BP5Writer::2::EndStep() 
 BP5Writer::0::WriteData() 
InitAggregator() - DataSizeBased: Closing and re-opening MPIChain
 BP5Writer::2::WriteData() 
InitAggregator() - DataSizeBased: Closing and re-opening MPIChain
 BP5Writer::1::BeginStep(StepMode::Append, -1) 
 BP5Writer::3::EndStep() 
 BP5Writer::1::EndStep() 
 BP5Writer::3::WriteData() 
InitAggregator() - DataSizeBased: Closing and re-opening MPIChain
 BP5Writer::1::WriteData() 
InitAggregator() - DataSizeBased: Closing and re-opening MPIChain
Rank data sizes: [32, 64, 96, 128Paritioning resulted in 2 substreams:
  0: [3 0 ], partition data size: 160
  1: [2 1 ], partition data size: 160
Rank 0 - A
Rank 0 - B
Rank Rank 1 - A
Rank 1 - B
Rank 1 - D
Rank 1 - F
Rank 2 - A
Rank 2 - B
Rank 2 - D
Rank 2 - F
Rank 3 - A
Rank 3 - B
Rank 3 - D
Rank 3 - F
0 - D
Rank 0 - E
Rank 0 - F
Rank 0 - G
Rank 0 - I
Rank 1 - G
Rank 1 - I
Rank 1 - J
Rank Rank 2 - G
Rank 2 - I
Rank 2 - J
Rank 2 - K
Rank 2 - L
Rank 3 - G
Rank 3 - I
Rank 3 - J
Rank 3 - K
Rank 3 - L
Rank 0 - J
Rank 0 - K
Rank 0 - L
1 - K
Rank 1 - L
Rank 0 - M
Rank 0 - N
Rank 0 - O
Rank 1 - M
Rank 1 - N
Rank 1 InitBPBuffer()
Rank 1 new writer needed, generate now, write later
Rank 2 - M
Rank 2 - N
Rank 2 InitBPBuffer()
Rank 3 - M
Rank 3 - N
Rank 3 InitBPBuffer()
Rank 3 new writer needed, generate now, write later
Rank 2 new writer needed, generate now, write later
Rank 0 InitBPBuffer()
Rank 0 new writer needed, generate now, write later
 BP5Writer::1::BeginStep(StepMode::Append, -1) 
 BP5Writer::2::BeginStep(StepMode::Append, -1) 
 BP5Writer::3::BeginStep(StepMode::Append, -1) 
 BP5Writer::0::WriteMetaMetadata() 
 BP5Writer::0::WriteMetadata() OVERLOAD2
 BP5Writer::1::EndStep() 
 BP5Writer::2::EndStep() 
 BP5Writer::0::WriteMetadataFileIndex() 
 BP5Writer::1::WriteData() 
 BP5Writer::2::WriteData() 
 BP5Writer::3::EndStep() 
 BP5Writer::3::WriteData() 
InitAggregator() - DataSizeBased: Closing and re-opening MPIChain
InitAggregator() - DataSizeBased: Closing and re-opening MPIChain
InitAggregator() - DataSizeBased: Closing and re-opening MPIChain
 BP5Writer::0::BeginStep(StepMode::Append, -1) 
 BP5Writer::0::EndStep() 
 BP5Writer::0::WriteData() 
InitAggregator() - DataSizeBased: Closing and re-opening MPIChain
Rank data sizes: [128, 64, 96, 32Paritioning resulted in 2 substreams:
  0: [0 3 ], partition data size: 160
  1: [2 1 ], partition data size: 160
Rank 1 cache hit for aggregator key 1-2-1-0
Rank 2 cache hit for aggregator key 1-2-0-1
Rank 2 - A
Rank 2 - B
Rank 2 - D
Rank 2 - F
Rank 0 - A
Rank 0 - B
Rank 0 - D
Rank 0 - E
Rank 1 - A
Rank 1 - B
Rank 1 - D
Rank 1 - F
Rank 3 - A
Rank 3 - B
Rank 3 - D
Rank 3 - F
Rank 0 - F
Rank 0 - G
Rank 0 - I
Rank 2 - G
Rank 2 - I
Rank 2 - J
Rank 2 - K
Rank 2 - L
Rank 3 - G
Rank 3 - I
Rank 3 - J
Rank 3 - K
Rank 3 - L
Rank 0 - J
Rank 0 - K
Rank 0 - L
Rank 1 - G
Rank 1 - I
Rank 1 - J
Rank 1 - K
Rank 1 - L
Rank 0 - M
Rank 0 - N
Rank 0 - O
Rank 3 - M
Rank 3 - N
Rank 3 InitBPBuffer()
Rank 3 new writer needed, generate now, write later
Rank 1 - M
Rank 1 - N
Rank 1 InitBPBuffer()
Rank 1 new writer needed, generate now, write later
Rank 2 - M
Rank 2 - N
Rank 2 InitBPBuffer()
Rank 2 new writer needed, generate now, write later
Rank 0 InitBPBuffer()
Rank 0 new writer needed, generate now, write later
 BP5Writer::0::WriteMetaMetadata() 
 BP5Writer::1::DoClose() 
 BP5Writer::2::DoClose() 
 BP5Writer::0::WriteMetadata() OVERLOAD2
 BP5Writer::3::DoClose() 
 BP5Writer::0::WriteMetadataFileIndex() 
 BP5Writer::0::DoClose() 
Finished writing unbalanced data, reading it back
unknown file: Failure
C++ exception with description "[Tue Jun 17 17:29:40 2025] [ADIOS2 EXCEPTION] <Engine> <BP5Reader> <ParseMetadataIndex> : ADIOS2 BP5 Engine only supports bp format version 5, found 0 version
" thrown in the test body.

[  FAILED  ] DSATest.TestWriteUnbalancedData (11 ms)
unknown file: Failure
C++ exception with description "[Tue Jun 17 17:29:40 2025] [ADIOS2 EXCEPTION] <Engine> <BP5Reader> <ParseMetadataIndex> : ADIOS2 BP5 Engine only supports bp format version 5, found 0 version
" thrown in the test body.

[  FAILED  ] DSATest.TestWriteUnbalancedData (11 ms)
[----------] 1 test from DSATest (11 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (11 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] unknown file: Failure
C++ exception with description "[Tue Jun 17 17:29:40 2025] [ADIOS2 EXCEPTION] <Engine> <BP5Reader> <ParseMetadataIndex> : ADIOS2 BP5 Engine only supports bp format version 5, found 0 version
" thrown in the test body.

[  FAILED  ] DSATest.TestWriteUnbalancedData (11 ms)
[----------] 1 test from DSATest (11 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (11 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] DSATest.TestWriteUnbalancedData

 1 FAILED TEST
unknown file: Failure
C++ exception with description "[Tue Jun 17 17:29:40 2025] [ADIOS2 EXCEPTION] <Engine> <BP5Reader> <ParseMetadataIndex> : ADIOS2 BP5 Engine only supports bp format version 5, found 0 version
" thrown in the test body.

[  FAILED  ] DSATest.TestWriteUnbalancedData (11 ms)
[----------] 1 test from DSATest (11 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (11 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] DSATest.TestWriteUnbalancedData

 1 FAILED TEST
[----------] 1 test from DSATest (11 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (11 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] DSATest.TestWriteUnbalancedData

 1 FAILED TEST
1 test, listed below:
[  FAILED  ] DSATest.TestWriteUnbalancedData

 1 FAILED TEST

You might notice in that case that I ran with 4 processes, and that the greedy partitioning strategy as well as my naive choice of cache key played a role in two ranks having a cache hit on the 2nd time step, and the other two having cache misses. So while all ranks write to the same file for t=2 as they did for t=1, because the partitioning strategy re-ordered the ranks in one of the substreams, and because the cache key includes the substream rank order, those two ranks think they need to open new files.

Also note the only reason there is no deadlock when running for 2 timesteps, is because currently the InitTransports() method is doing all the normal init steps, regardless of whether there was a cache hit or not, specifically to avoid deadlocks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the bp produces when running with 2 timesteps, bpls produces output that agrees with the testing code:

bpls output
$ ~/projects/EfficientInSituIO/fides_adios/adios2/install/bin/bpls -d -s "0,0" -c "-1,-1" ./unbalanced_output.bp
Failed to open with BPFile engine: [Tue Jun 17 17:31:35 2025] [ADIOS2 EXCEPTION] <Engine> <BP5Reader> <ParseMetadataIndex> : ADIOS2 BP5 Engine only supports bp format version 5, found 0 version


Error: Could not open this file with any ADIOS2 file reading engines

Copy link
Collaborator Author

@scottwittenburg scottwittenburg Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And bp5dbg has an even harder time with the 2-timestep output:

bp5dbg output
$ PYTHONPATH=~/projects/EfficientInSituIO/fides_adios/adios2/install/local/lib/python3.10/dist-packages ~/projects/EfficientInSituIO/fides_adios/adios2/install/bin/bp5dbg ./unbalanced_output.bp
========================================================
    Index Table File: ./unbalanced_output.bp/md.idx
========================================================
------------------------------------------------------------------------------------------------------------------
|        Version string            |Major|Minor|Patch|unused|Endian|BP version|BP minor|Active|ColumnMajor|unused|
|          32 bytes                |  1B |  1B |  1B |  1B  |  1B  |    1B    |   1B   |  1B  |    1B     |  23B |
+----------------------------------+-----+-----+-----+------+------+----------+--------+------+-----------+------+
|                                  |    |    |    |      | yes  |     0    |   0    |  no  |   yes     |      |
------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [0]
Unknown record , lenght = [0]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [0]
Unknown record , lenght = [0]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [0]
Unknown record , lenght = [0]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [0]
Unknown record , lenght = [0]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [0]
Unknown record , lenght = [0]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [0]
Unknown record , lenght = [0]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [0]
Unknown record , lenght = [0]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [0]
Unknown record , lenght = [0]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [0]
Unknown record , lenght = [0]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [0]
Unknown record , lenght = [0]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [0]
Unknown record , lenght = [0]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [0]
Unknown record , lenght = [0]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [0]
Unknown record , lenght = [0]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [0]
Unknown record , lenght = [0]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [242514657280]
Unknown record , lenght = [242514657280]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [67108864]
Unknown record , lenght = [67108864]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [0]
Unknown record , lenght = [0]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [512]
Unknown record , lenght = [512]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [0]
Unknown record , lenght = [0]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [72057594037927936]
Unknown record , lenght = [72057594037927936]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [0]
Unknown record , lenght = [0]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [15889042532990976]
Unknown record , lenght = [15889042532990976]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [1099511627776000]
Unknown record , lenght = [1099511627776000]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [4294967296000]
Unknown record , lenght = [4294967296000]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [0]
Unknown record , lenght = [0]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [536870912]
Unknown record , lenght = [536870912]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [3145728]
Unknown record , lenght = [3145728]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [8192]
Unknown record , lenght = [8192]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/local/KHQ/scott.wittenburg/projects/EfficientInSituIO/fides_adios/adios2/install/bin/bp5dbg", line 155, in <module>
    status = DumpIndexTableFile(args)
  File "/home/local/KHQ/scott.wittenburg/projects/EfficientInSituIO/fides_adios/adios2/install/bin/bp5dbg", line 93, in DumpIndexTableFile
    status, MetadataIndexTable, WriterMap = DumpIndexTable(
  File "/home/local/KHQ/scott.wittenburg/projects/EfficientInSituIO/fides_adios/adios2/install/local/lib/python3.10/dist-packages/adios2/bp5dbg/idxtable.py", line 156, in DumpIndexTable
    status, MetadataIndexTable, WriterMap = ReadIndex(
  File "/home/local/KHQ/scott.wittenburg/projects/EfficientInSituIO/fides_adios/adios2/install/local/lib/python3.10/dist-packages/adios2/bp5dbg/idxtable.py", line 64, in ReadIndex
    reclen = np.frombuffer(table, dtype=np.uint64, count=1,
ValueError: buffer is smaller than requested size

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And finally, note that I think the test logic itself is more or less sound for 2 timesteps, because if you provide EveryoneWritesSerial as the first argument, the test passes, and bpls / bp5dbg produce reasonable-looking (to me) output.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears to me from the output of the test / bpls / bp5dbg, that the error is coming from here and thus possibly caused by some problem in the metadata index file?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And to make sure I understand, you would only expect this problem to rear its head when we exceed OneLevelGatherRanksLimit?

Yes, I believe that that's true...

Copy link
Collaborator Author

@scottwittenburg scottwittenburg Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm mulling those first two options. Upon first consideration, option 1 seems more attractive since it seems easier to implement (but maybe I just haven't seen far enough along that path yet).

Yes, I believe that that's true...

And when it happens, how do you expect it would manifest itself? I'm just looking for a rough idea here, like would it segfault when writing, or just produce data that can't be read back in correctly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a segfault when writing, but definitely bad data on read (unless you got lucky). That "per rank metadata size" array is gathered in a one-level gather, so the size values will be in a canonical order. It's then used to create the per-rank size values in the subsequent gathervs, be they one or two level. If we end up associating a size with the wrong rank we'll gather trash data, either too little from one rank or too much from another. The latter might be a segfault if it was unmapped memory, but even if it didn't blow up on the writer, we'd be calculating the wrong start positions for metadata on the reader and blow up when trying to parse. (Actually, you'd likely get an FFS error about unknown formats because it wouldn't find the right header when starting the parse...)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going back to consider the two-level metadata aggregation, and want to get more clarification on the problem in general, and one of those approaches you mentioned specifically.

Generally, I'm not sure I see in the code how the order of the initial data aggregator/mpichains has anything to do with the metadata aggregator/mpichains. They are currently separate aggregators, and as far as I can see, their only connection to each other is that they both have MPI_COMM_WORLD (or whatever communicator is passed to the writer's constructor) as their parent communicator.

And regarding option 1 specifically, you said:

First, we could separate the aggregation communicators used for data from that used for metadata. There is no logical relationship between those operations, so by sticking with the original plan for metadata communication we avoid all these issues. (And I don't think there's overhead associated with having multiple subcommunicators.).

When you say "original plan" here, do you mean how metadata is aggregated before this PR? Or is it rather something we discussed in a meeting? So far in this PR I think I tried not to touch how metadata is aggregated, so is it possible this is already a non-issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If those aggregators are disjoint, then it's a non-issue. I was under the impression that we were using the same aggregators for both data and metadata aggregation when metadata aggregation was multi-level. If we're not, then no problems.

@scottwittenburg scottwittenburg force-pushed the data-size-based-aggregation branch from c9de252 to 4408611 Compare June 18, 2025 01:56
@scottwittenburg
Copy link
Collaborator Author

scottwittenburg commented Jun 18, 2025

With my current implementation, I have found a broken case when 3 of 5 ranks get an aggregator cache hit, while 2 of 5 do not.

I'll try to describe the problem by first showing the expected outputs when the aggregation type is EveryoneWritesSerial, and then compare that to the outputs when the aggretation type is DataSizeBased.

In both cases I run with 5 mpi processes, 3 subfiles, and 2 timesteps.

First look at the data produced when using EveryoneWritesSerial, and note that the test passes in the process of producing the data.

EveryoneWritesSerial - bpls
$ ~/projects/EfficientInSituIO/fides_adios/adios2/install/bin/bpls -d -D ./unbalanced_output.bp/
  uint64_t  GlobalArray  2*{5, 15}
        step 0: 
          block 0: [0:4,  0: 0]
    (0,0)    0 15 30 45 60 
          block 1: [0:4,  1: 2]
    (0,0)    1 2 16 17 31 32
    (3,0)    46 47 61 62 
          block 2: [0:4,  3: 5]
    (0,0)    3 4 5 18 19 20
    (2,0)    33 34 35 48 49 50
    (4,0)    63 64 65 
          block 3: [0:4,  6: 9]
    (0,0)    6 7 8 9 21 22
    (1,2)    23 24 36 37 38 39
    (3,0)    51 52 53 54 66 67
    (4,2)    68 69 
          block 4: [0:4, 10:14]
    (0,0)    10 11 12 13 14 25
    (1,1)    26 27 28 29 40 41
    (2,2)    42 43 44 55 56 57
    (3,3)    58 59 70 71 72 73
    (4,4)    74 
        step 1: 
          block 0: [0:4,  0: 4]
    (0,0)    74 75 76 77 78 89
    (1,1)    90 91 92 93 104 105
    (2,2)    106 107 108 119 120 121
    (3,3)    122 123 134 135 136 137
    (4,4)    138 
          block 1: [0:4,  5: 6]
    (0,0)    79 80 94 95 109 110
    (3,0)    124 125 139 140 
          block 2: [0:4,  7: 9]
    (0,0)    81 82 83 96 97 98
    (2,0)    111 112 113 126 127 128
    (4,0)    141 142 143 
          block 3: [0:4, 10:13]
    (0,0)    84 85 86 87 99 100
    (1,2)    101 102 114 115 116 117
    (3,0)    129 130 131 132 144 145
    (4,2)    146 147 
          block 4: [0:4, 14:14]
    (0,0)    88 103 118 133 148
EveryoneWritesSerial - bp5dbg
$ PYTHONPATH=~/projects/EfficientInSituIO/fides_adios/adios2/install/local/lib/python3.10/dist-packages ~/projects/EfficientInSituIO/fides_adios/adios2/install/bin/bp5dbg ./unbalanced_output.bp/
========================================================
    Index Table File: ./unbalanced_output.bp//md.idx
========================================================
------------------------------------------------------------------------------------------------------------------
|        Version string            |Major|Minor|Patch|unused|Endian|BP version|BP minor|Active|ColumnMajor|unused|
|          32 bytes                |  1B |  1B |  1B |  1B  |  1B  |    1B    |   1B   |  1B  |    1B     |  23B |
+----------------------------------+-----+-----+-----+------+------+----------+--------+------+-----------+------+
| ADIOS-BP v2.:.0 Index Table      |  2  |  :  |  0  |      | yes  |     5    |   2    |  no  |    no     |      |
------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record 'w', length = [64]
  WriterMap: Writers = 5  Aggregators = 3  Subfiles = 3
  =====================
  |  Rank  |  Subfile |
  ---------------------
  |      0 |        0 |
  |      1 |        0 |
  |      2 |        1 |
  |      3 |        1 |
  |      4 |        2 |
  =====================
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record 's', length = [64]
|   Step = 0     | MetadataPos = 0          |  MetadataSize = 1248         | FlushCount = 0  |
 Writer 0 data loc:0
 Writer 1 data loc:4096
 Writer 2 data loc:0
 Writer 3 data loc:4096
 Writer 4 data loc:0
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record 's', length = [64]
|   Step = 1     | MetadataPos = 1248       |  MetadataSize = 1248         | FlushCount = 0  |
 Writer 0 data loc:8192
 Writer 1 data loc:12288
 Writer 2 data loc:8192
 Writer 3 data loc:12288
 Writer 4 data loc:4096
--------------------------------------------------------------------------------------------------
========================================================
    MetaMetadata File: ./unbalanced_output.bp/mmd.0
========================================================
 File size = 628
 --------------------------------------------------
 |  Record |  Offset  |  ID length |  Info length |
 --------------------------------------------------
 |       0 |        0 |         12 |          600 |
 --------------------------------------------------
========================================================
    Metadata File: ./unbalanced_output.bp/md.0
========================================================
Step 0: 
  Offset = 0
  Size = 1240
  Variable metadata entries: 
    Writer 0: md size 232 offset 80
    Writer 1: md size 232 offset 312
    Writer 2: md size 232 offset 544
    Writer 3: md size 232 offset 776
    Writer 4: md size 232 offset 1008
  Attribute metadata entries: 
    Writer 0: md size 0 offset 1240
    Writer 1: md size 0 offset 1240
    Writer 2: md size 0 offset 1240
    Writer 3: md size 0 offset 1240
    Writer 4: md size 0 offset 1240
========================================================
Step 1: 
  Offset = 1248
  Size = 1240
  Variable metadata entries: 
    Writer 0: md size 232 offset 1328
    Writer 1: md size 232 offset 1560
    Writer 2: md size 232 offset 1792
    Writer 3: md size 232 offset 2024
    Writer 4: md size 232 offset 2256
  Attribute metadata entries: 
    Writer 0: md size 0 offset 2488
    Writer 1: md size 0 offset 2488
    Writer 2: md size 0 offset 2488
    Writer 3: md size 0 offset 2488
    Writer 4: md size 0 offset 2488

Now look at the data produced when using DataSizeBased (which is associated with test failure).

DataSizeBased - bpls
$ ~/projects/EfficientInSituIO/fides_adios/adios2/install/bin/bpls -d -D ./unbalanced_output.bp/
  uint64_t  GlobalArray  2*{5, 15}
        step 0: 
          block 0: [0:4,  0: 0]
    (0,0)    0 0 0 0 0 
          block 1: [0:4,  1: 2]
    (0,0)    1 2 16 17 31 32
    (3,0)    46 47 61 62 
          block 2: [0:4,  3: 5]
    (0,0)    3 4 5 18 19 20
    (2,0)    33 34 35 48 49 50
    (4,0)    63 64 65 
          block 3: [0:4,  6: 9]
    (0,0)    0 0 0 0 0 0
    (1,2)    0 0 0 0 0 0
    (3,0)    0 0 0 0 0 0
    (4,2)    0 0 
          block 4: [0:4, 10:14]
    (0,0)    0 0 0 0 0 0
    (1,1)    0 0 0 0 0 0
    (2,2)    0 0 0 0 0 0
    (3,3)    0 0 0 0 0 0
    (4,4)    0 
        step 1: 
          block 0: [0:4,  0: 4]
    (0,0)    74 75 76 77 78 89
    (1,1)    90 91 92 93 104 105
    (2,2)    106 107 108 119 120 121
    (3,3)    122 123 134 135 136 137
    (4,4)    138 
          block 1: [0:4,  5: 6]
    (0,0)    79 80 94 95 109 110
    (3,0)    124 125 139 140 
          block 2: [0:4,  7: 9]
    (0,0)    81 82 83 96 97 98
    (2,0)    111 112 113 126 127 128
    (4,0)    141 142 143 
          block 3: [0:4, 10:13]
    (0,0)    0 0 0 0 0 0
    (1,2)    0 0 0 0 0 0
    (3,0)    0 0 0 0 0 0
    (4,2)    0 0 
          block 4: [0:4, 14:14]
    (0,0)    88 103 118 133 148
DataSizeBased - bp5dbg
$ PYTHONPATH=~/projects/EfficientInSituIO/fides_adios/adios2/install/local/lib/python3.10/dist-packages ~/projects/EfficientInSituIO/fides_adios/adios2/install/bin/bp5dbg ./unbalanced_output.bp/
========================================================
    Index Table File: ./unbalanced_output.bp//md.idx
========================================================
------------------------------------------------------------------------------------------------------------------
|        Version string            |Major|Minor|Patch|unused|Endian|BP version|BP minor|Active|ColumnMajor|unused|
|          32 bytes                |  1B |  1B |  1B |  1B  |  1B  |    1B    |   1B   |  1B  |    1B     |  23B |
+----------------------------------+-----+-----+-----+------+------+----------+--------+------+-----------+------+
| ADIOS-BP v2.:.0 Index Table      |  2  |  :  |  0  |      | yes  |     5    |   2    |  no  |    no     |      |
------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record 'w', length = [64]
  WriterMap: Writers = 5  Aggregators = 0  Subfiles = 3
  =====================
  |  Rank  |  Subfile |
  ---------------------
  |      0 |        1 |
  |      1 |        2 |
  |      2 |        2 |
  |      3 |        1 |
  |      4 |        0 |
  =====================
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record 's', length = [64]
|   Step = 0     | MetadataPos = 0          |  MetadataSize = 1248         | FlushCount = 0  |
 Writer 0 data loc:4096
 Writer 1 data loc:4096
 Writer 2 data loc:0
 Writer 3 data loc:0
 Writer 4 data loc:0
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record 'w', length = [64]
  WriterMap: Writers = 5  Aggregators = 0  Subfiles = 3
  =====================
  |  Rank  |  Subfile |
  ---------------------
  |      0 |        0 |
  |      1 |        2 |
  |      2 |        2 |
  |      3 |        1 |
  |      4 |        1 |
  =====================
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record 's', length = [64]
|   Step = 1     | MetadataPos = 1248       |  MetadataSize = 1248         | FlushCount = 0  |
 Writer 0 data loc:8192
 Writer 1 data loc:12288
 Writer 2 data loc:8192
 Writer 3 data loc:8192
 Writer 4 data loc:12288
--------------------------------------------------------------------------------------------------
========================================================
    MetaMetadata File: ./unbalanced_output.bp/mmd.0
========================================================
 File size = 628
 --------------------------------------------------
 |  Record |  Offset  |  ID length |  Info length |
 --------------------------------------------------
 |       0 |        0 |         12 |          600 |
 --------------------------------------------------
========================================================
    Metadata File: ./unbalanced_output.bp/md.0
========================================================
Step 0: 
  Offset = 0
  Size = 1240
  Variable metadata entries: 
    Writer 0: md size 232 offset 80
    Writer 1: md size 232 offset 312
    Writer 2: md size 232 offset 544
    Writer 3: md size 232 offset 776
    Writer 4: md size 232 offset 1008
  Attribute metadata entries: 
    Writer 0: md size 0 offset 1240
    Writer 1: md size 0 offset 1240
    Writer 2: md size 0 offset 1240
    Writer 3: md size 0 offset 1240
    Writer 4: md size 0 offset 1240
========================================================
Step 1: 
  Offset = 1248
  Size = 1240
  Variable metadata entries: 
    Writer 0: md size 232 offset 1328
    Writer 1: md size 232 offset 1560
    Writer 2: md size 232 offset 1792
    Writer 3: md size 232 offset 2024
    Writer 4: md size 232 offset 2256
  Attribute metadata entries: 
    Writer 0: md size 0 offset 2488
    Writer 1: md size 0 offset 2488
    Writer 2: md size 0 offset 2488
    Writer 3: md size 0 offset 2488
    Writer 4: md size 0 offset 2488

Note when reading the output above that the size based partitioning produced at t=1 was [[4],[3,0],[2,1]], while at t=2 the partitioning was [[0],[3,4],[2,1]]. Considering that information there's at least one noticeable problem: The Writer 0 data loc: 8192 for the second timestep seems like it should actually be Writer 0 data loc: 4096 since only rank 0 wrote to file 0 on the previous time step.

Any idea where I should be looking to correct that problem? Seems like it's in the metadata index file, but beyond that, I'm not sure. Also, it appears to me that the other writer data locations for timestep 2 are correct, which is confusing, unless I'm interpreting the data wrong.

One last note: when repeating the above tests with only one time step, the test passes in both cases, and the data looks fine To me that seems to indicate the problem is related to repeating some of the initialization in InitTransports() and InitBPBuffer() more than once.

@eisenhauer
Copy link
Member

Hmm. A quick question. Are you using SelectiveAggregationMetadata? I.E. is the UseSelectiveMetadataAggregation engine parameter in its default "true" value and the first branch of this if in BP5Writer being taken?

    if (m_Parameters.UseSelectiveMetadataAggregation)
    {
        SelectiveAggregationMetadata(TSInfo);
    }
    else
    {
        TwoLevelAggregationMetadata(TSInfo);
    }

Because if not, if you're falling into TwoLevelAggregationMetadata, then all the unfortunate effects that I might have anticipated for >6K nodes are hitting you now...

@scottwittenburg
Copy link
Collaborator Author

A quick question. Are you using SelectiveAggregationMetadata?

Ah yes, great question, I assume you asked because you noticed the re-ordering between timesteps. I just confirmed that in this case, it's taking the SelectiveAggregationMetadata path.

@scottwittenburg
Copy link
Collaborator Author

The aggregator rank in each mpi chain receives the final data position from the last rank in the chain. Is that because there is an assumption there that the aggregator rank is going to be writing to the same file the next time around?

@eisenhauer
Copy link
Member

Just wanted to make sure, because the TwoLevel stuff uses those aggregators and if they're changing, potentially bad things. Also, as a debugging aid, make sure this #ifdef DUMPDATALOCINFO evaluates to true so that you get debugging output from writer rank 0 without having to rely on bp5dbg. I'm assuming that you're not doing any data flushes, so what happens in FlushData is irrelevant. So you'll be tracking the m_StartDataPos that comes into SelectiveAggregationMetadata. Whatever happens in WriteData, that should accurately reflect where that rank's data starts in whatever file is landed in. If it doesn't, nothing else will go well after that.

@eisenhauer
Copy link
Member

Something else to note while I'm thinking about it. You'll notice that WriteData() is also being called in FlushData(). FlushData is a user-visible BP5 call that allows the application to write partial timestep data in the middle of a timestep (I.E. before EndStep()). This is useful if you are pushing memory limits and don't have room to buffer. So, for example you could do BeginStep(),Put(),Put(),FlushData(),Put(),Put(),FlushData(),Put(),EndStep(). Each of those flushes and the end step results in data (but not metadata) going to disk. The FlushPosSizeInfo vector on rank 0 accumulates all the info from each rank for where it's data landed and how much data it was on each Flush. Note that as a natural consequence of writing piece-wise the data from rank X is not contiguous in the data file, but broken up into multiple segments. In order to find the data at Offset M, timestep T, from rank N, we have to have the offset and size of each chunk, so this all ends up in a single metadataindex record for that timestep. The relevant bit for you is that WriterMapRecords only occur between timesteps. Which means that we don't have the ability to let a rank change what file it is writing to in the middle of a timestep. So, the first time WriteData() is called in a timestep (which might be in a FlushData or in EndStep), you are free to change the WriterSubfileMap. But you should not change it again in that timestep. You're probably not playing with those examples yet, but this is something that will require a little bit of extra logic to make sure it doesn't happen.

// Stuff we only want to do on the first timestep
m_SubstreamDataPos.resize(m_Parameters.NumSubFiles);
}
m_CommAggregators =
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appeared to me that m_CommAggregators was split off from the world, and then never used anywhere, so I decided to re-purpose it. Hopefully that's acceptable and I didn't misread the code.

@scottwittenburg
Copy link
Collaborator Author

scottwittenburg commented Jun 25, 2025

By the way, on the master branch, when I #define DUMPDATALOCINFO, this is what it outputs for the first time step:

Extra debug output:
315: Flush count is :0
315: Write Index positions = {
315: Writer 0 has data at: 
315: loc:O
315: Writer 1 has data at: 
315: loc:S
315: Writer 2 has data at: 
315: loc:-
315: Writer 3 has data at: 
315: loc:B
315: Writer 4 has data at: 
315: loc:P
315: Writer 5 has data at: 
315: loc: 
315: Writer 6 has data at: 
315: loc:v
315: Writer 7 has data at: 
315: loc:2
315: Writer 8 has data at: 
315: loc:.
315: Writer 9 has data at: 
315: loc::
315: Writer 10 has data at: 
315: loc:.
315: Writer 11 has data at: 
315: loc:0
315: Writer 12 has data at: 
315: loc: 
315: Writer 13 has data at: 
315: loc:I
315: Writer 14 has data at: 
315: loc:n
315: Writer 15 has data at: 
315: loc:d
315: Writer 16 has data at: 
315: loc:e
315: Writer 17 has data at: 
315: loc:x
315: Writer 18 has data at: 
315: loc: 
315: Writer 19 has data at: 
315: loc:T
315: Writer 20 has data at: 
315: loc:a
315: Writer 21 has data at: 
315: loc:b
315: Writer 22 has data at: 
315: loc:l
315: Writer 23 has data at: 
315: loc:e

For subsequent time steps, it seems to print nothing instead of those characters from the buffer. I wonder if that block didn't get updated when some other bit of the code did, due to it not getting exercised and validated?

@eisenhauer
Copy link
Member

For subsequent time steps, it seems to print nothing instead of those characters from the buffer. I wonder if that block didn't get updated when some other bit of the code did, due to it not getting exercised and validated?

Ah, yes, that's probably the case. Those items are size_t and maybe the type of buf changed. Which may mean those index calcs into it are off too. I can't look tonight, but will check tomorrow.

@eisenhauer
Copy link
Member

OK, try substituting in this code into the matching bits of WriteMetadataFileIndex()

    // Step record
    record = StepRecord;
#ifndef DUMPDATALOCINFO
    size_t StepRecordStartPos = pos;
#endif
    helper::CopyToBuffer(buf, pos, &record, 1); // record type
    d = (3 + ((FlushPosSizeInfo.size() * 2) + 1) * m_Comm.Size()) * sizeof(uint64_t);
    helper::CopyToBuffer(buf, pos, &d, 1); // record length
    helper::CopyToBuffer(buf, pos, &MetaDataPos, 1);
    helper::CopyToBuffer(buf, pos, &MetaDataSize, 1);
    d = static_cast<uint64_t>(FlushPosSizeInfo.size());
    helper::CopyToBuffer(buf, pos, &d, 1);

    for (int writer = 0; writer < m_Comm.Size(); writer++)
    {
        for (size_t flushNum = 0; flushNum < FlushPosSizeInfo.size(); flushNum++)
        {
            // add two numbers here
            helper::CopyToBuffer(buf, pos, &FlushPosSizeInfo[flushNum][2 * writer], 2);
        }
        helper::CopyToBuffer(buf, pos, &m_WriterDataPos[writer], 1);
    }

    m_FileMetadataIndexManager.WriteFiles((char *)buf.data(), buf.size());
#ifndef DUMPDATALOCINFO
    std::cout << "WriterMapRecordType is: " << (buf.data() + StepRecordStartPos)[0] << std::endl;
    size_t *BufPtr = (size_t*)(buf.data() + StepRecordStartPos + 1);
    std::cout << "WriterMapRecordLength is: " << *BufPtr++ << std::endl;
    std::cout << "MetadataPos is: " << *BufPtr++ << std::endl;
    std::cout << "MetadataSize is: " << *BufPtr++ << std::endl;
    std::cout << "Flush count is :" << *BufPtr++ << std::endl;
    std::cout << "Write Index positions = {" << std::endl;

    for (size_t i = 0; i < m_Comm.Size(); ++i)
    {
        std::cout << "Writer " << i << " has data at: " << std::endl;
        uint64_t eachWriterSize = FlushPosSizeInfo.size() * 2 + 1;
        for (size_t j = 0; j < FlushPosSizeInfo.size(); ++j)
        {
            std::cout << "loc:" << *BufPtr++;
	    std::cout << " siz:" << *BufPtr++ << std::endl;
        }
        std::cout << "loc:" << *BufPtr++ << std::endl;
    }
    std::cout << "}" << std::endl;
#endif
    m_FileMetadataIndexManager.FlushFiles();

(I won't put these changes into a PR now because that would just make merging your code more difficult when it comes to that.)

@scottwittenburg
Copy link
Collaborator Author

OK, try substituting in this code into the matching bits of WriteMetadataFileIndex()

Thanks Greg, that works and jives with both the extra debug printing I was doing, as well as with what bp5dbg shows. Having this is nice though because I have been printing the source information and what you have here confirms that is transferred to the buffer correctly.

So I'm just curious if this summary below 1) contains all the relevant information you would expect is needed to indicate correct writing of the data, and 2) if it actually looks correct to you based on the writer chains at each time step (it looks correct to me).

Debug output showing writer chains and metadata
Step 0:

    Rank data sizes: [48, 80, 128, 160, 208]
    Paritioning resulted in 3 substreams:
    0: [4 ]
    1: [3 0 ]
    2: [2 1 ]

    WriterMapRecordType is: s
    WriterMapRecordLength is: 64
    MetadataPos is: 0
    MetadataSize is: 1248
    Flush count is :0

    Write Index positions = {
    Writer 0 has data at:
    loc:4096
    Writer 1 has data at:
    loc:4096
    Writer 2 has data at:
    loc:0
    Writer 3 has data at:
    loc:0
    Writer 4 has data at:
    loc:0
    }

Step 1:

    Rank data sizes: [208, 80, 128, 160, 48]
    Paritioning resulted in 3 substreams:
    0: [0 ]
    1: [3 4 ]
    2: [2 1 ]

    WriterMapRecordType is: s
    WriterMapRecordLength is: 64
    MetadataPos is: 1248
    MetadataSize is: 1248
    Flush count is :0

    Write Index positions = {
    Writer 0 has data at:
    loc:4096
    Writer 1 has data at:
    loc:12288
    Writer 2 has data at:
    loc:8192
    Writer 3 has data at:
    loc:8192
    Writer 4 has data at:
    loc:12288
    }

And here is the comparison of what is produced by EveryoneWritesSerial aggregation vs DataSizeBased:

Side-by-side bpls output
       EveryoneWritesSerial          |             DataSizeBased
-----------------------------------  |  -----------------------------------
  uint64_t  GlobalArray  2*{5, 15}   |    uint64_t  GlobalArray  2*{5, 15}
        step 0:                      |          step 0:
          block 0: [0:4,  0: 0]      |            block 0: [0:4,  0: 0]
    (0,0)    0 15 30 45 60           |      (0,0)    0 0 0 0 0
          block 1: [0:4,  1: 2]      |            block 1: [0:4,  1: 2]
    (0,0)    1 2 16 17 31 32         |      (0,0)    1 2 16 17 31 32
    (3,0)    46 47 61 62             |      (3,0)    46 47 61 62
          block 2: [0:4,  3: 5]      |            block 2: [0:4,  3: 5]
    (0,0)    3 4 5 18 19 20          |      (0,0)    3 4 5 18 19 20
    (2,0)    33 34 35 48 49 50       |      (2,0)    33 34 35 48 49 50
    (4,0)    63 64 65                |      (4,0)    63 64 65
          block 3: [0:4,  6: 9]      |            block 3: [0:4,  6: 9]
    (0,0)    6 7 8 9 21 22           |      (0,0)    0 0 0 0 0 0
    (1,2)    23 24 36 37 38 39       |      (1,2)    0 0 0 0 0 0
    (3,0)    51 52 53 54 66 67       |      (3,0)    0 0 0 0 0 0
    (4,2)    68 69                   |      (4,2)    0 0
          block 4: [0:4, 10:14]      |            block 4: [0:4, 10:14]
    (0,0)    10 11 12 13 14 25       |      (0,0)    0 0 0 0 0 0
    (1,1)    26 27 28 29 40 41       |      (1,1)    0 0 0 0 0 0
    (2,2)    42 43 44 55 56 57       |      (2,2)    0 0 0 0 0 0
    (3,3)    58 59 70 71 72 73       |      (3,3)    0 0 0 0 0 0
    (4,4)    74                      |      (4,4)    0
        step 1:                      |          step 1:
          block 0: [0:4,  0: 4]      |            block 0: [0:4,  0: 4]
    (0,0)    74 75 76 77 78 89       |      (0,0)    74 75 76 77 78 89
    (1,1)    90 91 92 93 104 105     |      (1,1)    90 91 92 93 104 105
    (2,2)    106 107 108 119 120 121 |      (2,2)    106 107 108 119 120 121
    (3,3)    122 123 134 135 136 137 |      (3,3)    122 123 134 135 136 137
    (4,4)    138                     |      (4,4)    138
          block 1: [0:4,  5: 6]      |            block 1: [0:4,  5: 6]
    (0,0)    79 80 94 95 109 110     |      (0,0)    79 80 94 95 109 110
    (3,0)    124 125 139 140         |      (3,0)    124 125 139 140
          block 2: [0:4,  7: 9]      |            block 2: [0:4,  7: 9]
    (0,0)    81 82 83 96 97 98       |      (0,0)    81 82 83 96 97 98
    (2,0)    111 112 113 126 127 128 |      (2,0)    111 112 113 126 127 128
    (4,0)    141 142 143             |      (4,0)    141 142 143
          block 3: [0:4, 10:13]      |            block 3: [0:4, 10:13]
    (0,0)    84 85 86 87 99 100      |      (0,0)    0 0 0 0 0 0
    (1,2)    101 102 114 115 116 117 |      (1,2)    0 0 0 0 0 0
    (3,0)    129 130 131 132 144 145 |      (3,0)    0 0 0 0 0 0
    (4,2)    146 147                 |      (4,2)    0 0
          block 4: [0:4, 14:14]      |            block 4: [0:4, 14:14]
    (0,0)    88 103 118 133 148      |      (0,0)    88 103 118 133 148

Let me know if you spot something that looks wrong, or if I haven't included complete details.

@eisenhauer
Copy link
Member

The metadatafile index stuff does look like what I'd expect. However, things obviously aren't right even with just a single timestep, so something is clearly going badly. I'd like to try to reproduce this output and would be hopeful that I if I step through with the debugger on the reader side I might see what the issue is. Can you tell me what/how you're running on the writer side so I can give that a go?

@scottwittenburg
Copy link
Collaborator Author

However, things obviously aren't right even with just a single timestep,

Well, if I run the test with just a single timestep, everything looks fine and works. But when I run the test with two timesteps, somehow the first timestep get's messed up in retrospect or something.

so something is clearly going badly

😅 That's for sure!

Can you tell me what/how you're running on the writer side so I can give that a go?

That would be so great, thank you! I'm just running variations of one of the new tests I added, which you can run with all the defaults as:

ctest -R Engine.BP.DSATest.TestWriteUnbalancedData.BP5.DSB.MPI --verbose

The --verbose argument to ctest shows the underlying command it runs, and you can just run that if you want to change the number of mpi processes or any of the default arguments (which are: aggregation type, number of subfiles, and number of time steps). This is what I'm doing to produce the broken bp file: 5 ranks, DataSizeBased aggregation, 3 subfiles, 2 timesteps. Here'e the command:

mpiexec "-n" "5" \
    "./bin/Test.Engine.BP.DataSizeAggregate.MPI" \
    "--gtest_filter=DSATest.TestWriteUnbalancedData" \
    "DataSizeBased" "3" "2"

Many combinations of parameters don't result in failure or messed up output file like the example above does. I suspect it's probably related to how some of the ranks change the subfile they're writing between steps. You can try with 1 timestep instead of 2, and should see that it works.

Note: I need to rm -rf ./unbalanaced_output.bp from the build tree between runs, or it sometimes causes the test to fail. Maybe because some of the files don't get overwritten? Or running with more ranks and subsequently fewer leaves files in the directory that break things? 🤷

@eisenhauer
Copy link
Member

eisenhauer commented Jun 26, 2025

OK, I'll give this a go. I've got a couple of calls yet this afternoon, so it won't be right away. Just FYI, in case you get there first, my plan is to look at what is happening in BP5Reader::ReadData(). The BP5Deserializer looks at the Get()s that happen in a timestep and translate those into a ReadList of data that it needs. That list looks like : from WriterRank W data from Timestep T, read Length bytes from Offset S. ReadData()'s responsibility is to translate that into actual data access. One of the first things it does is to translate W and T to a SubfileNum. Looking at what's going wrong there should be informative.

@scottwittenburg
Copy link
Collaborator Author

Ok thanks for that information, I'll start looking in BP5Reader::ReadData() as well, since I think it will be helpful for me to have at least some familiarity with the read side.

@eisenhauer
Copy link
Member

output.txt
Attaching a file with output from bpls with some extra diagnostic output. It looks like we're going to the right data file and the right place in the data file to read for timestep 0, writer rank 0.

(base) eisen@Endor build % bin/bpls -D -d unbalanced_output.bp/
  uint64_t  GlobalArray  2*{5, 15}
        step 0: 
          block 0: [0:4,  0: 0]
WritermapIndex for Timestep 0 is 0
Read from Timestep 0 WriterRank 0 Going to Subfilenum 1
Subfilenum 1 has name unbalanced_output.bp/data.1
DataPosition for Timestep 0 WriterRank 0 is 4096
Offset of Data in block is 0 Final offset in subfile 1 is 4096
    (0,0)    0 0 0 0 0 

But we're not finding the data that should be there. You say that this works with writing just one timestep. Is it possible that writing the 2nd timestep is killing some of the data from the first?

@eisenhauer
Copy link
Member

Whatever is going on, subfile 1 has zeroes up through Offset 12288, and that doesn't look right.

(base) eisen@Endor build % od -A x unbalanced_output.bp/data.1 
0000000    000000  000000  000000  000000  000000  000000  000000  000000
*
0003000    000130  000000  000000  000000  000147  000000  000000  000000
0003010    000166  000000  000000  000000  000205  000000  000000  000000
0003020    000224  000000  000000  000000  000000  000000  000000  000000
0003030

@scottwittenburg
Copy link
Collaborator Author

scottwittenburg commented Jun 26, 2025

subfile 1 has zeroes up through Offset 12288, and that doesn't look right

Huh, that is interesting. There is initialization stuff that gets done on every timestep now (used to only be done once), and I'm just selectively guarding against doing some specific things within there. Maybe more bits need guarded against running 1) on a timestep after the first, or 2) in the case of a aggregator output file cache hit. Like seeking to position 0 in a file, for example.

I wrote a tiny standalone program just to read the file (mainly so I could run it with a single process), and then added some debug printing to the reader. I see a couple weird things off the bat:

single process reader debug output
rank 0 size 1
ReadRequest:   Timestep: 0  WriterRank: 4  StartOffset: 0  ReadLength: 200
ReadData SubFileNum: 0
ReadRequest:   Timestep: 0  WriterRank: 0  StartOffset: 0  ReadLength: 40
ReadData SubFileNum: 1
ReadRequest:   Timestep: 0  WriterRank: 3  StartOffset: 0  ReadLength: 160
ReadData SubFileNum: 1
ReadRequest:   Timestep: 0  WriterRank: 1  StartOffset: 0  ReadLength: 80
ReadData SubFileNum: 2
ReadRequest:   Timestep: 0  WriterRank: 2  StartOffset: 0  ReadLength: 120
ReadData SubFileNum: 2
Step data:
0 1 2 3 4 5 0 0 0 0 0 0 0 0 0 0 16 17 18 19 20 0 0 0 0 0 0 0 0 0 0 31 32 33 34 35 0 0 0 0 0 0 0 0 0 0 46 47 48 49 50 0 0 0 0 0 0 0 0 0 0 61 62 63 64 65 0 0 0 0 0 0 0 0 0 
ReadRequest:   Timestep: 1  WriterRank: 0  StartOffset: 0  ReadLength: 200
ReadData SubFileNum: 0
ReadRequest:   Timestep: 1  WriterRank: 3  StartOffset: 0  ReadLength: 160
ReadData SubFileNum: 1
ReadRequest:   Timestep: 1  WriterRank: 4  StartOffset: 0  ReadLength: 40
ReadData SubFileNum: 1
ReadRequest:   Timestep: 1  WriterRank: 1  StartOffset: 0  ReadLength: 80
ReadData SubFileNum: 2
ReadRequest:   Timestep: 1  WriterRank: 2  StartOffset: 0  ReadLength: 120
ReadData SubFileNum: 2
Step data:
74 75 76 77 78 79 80 81 82 83 0 0 0 0 88 89 90 91 92 93 94 95 96 97 98 0 0 0 0 103 104 105 106 107 108 109 110 111 112 113 0 0 0 0 118 119 120 121 122 123 124 125 126 127 128 0 0 0 0 133 134 135 136 137 138 139 140 141 142 143 0 0 0 0 148

It's odd to me that the start offset is 0 regardless of timestep or writer rank. It's also odd that three of the read lengths are off by 8 bytes (something I also see in bp5dbg output). Based on the amounts of data I debug printed on each rank, I would think 200 above should be 208, 120 should be 128, and 40 should be 48.

@scottwittenburg
Copy link
Collaborator Author

Hmmmm, I wonder: If rank X opens a file already opened and written by rankY, should it be doing that in append mode or else the file will get overwritten?

@scottwittenburg
Copy link
Collaborator Author

If rank X opens a file already opened and written by rankY, should it be doing that in append mode or else the file will get overwritten?

That was the problem, the hint from running od was the clue that eventually led to the solution. In 13d7f06, I have patched it so that the problem case now passes the test and produces the expected data!

@eisenhauer
Copy link
Member

Great! Yeah, you may be running into filesystem oddities with this approach. Are you closing files after the write responsibility shifts? If not I can imagine that if the responsibility shifts back you might need to close and re-open to get a sane file pointer.

@scottwittenburg
Copy link
Collaborator Author

But we're not finding the data that should be there. You say that this works with writing just one timestep. Is it possible that writing the 2nd timestep is killing some of the data from the first?

Also that clue is huge 😆

Your help was invaluable in finding this bug I wrote, thank you @eisenhauer!

I still have 2 things to handle which you pointed out in previous comments:

  1. two-level metadata aggregation: either metadata swizzling or else separate comm for metadata chains
  2. guarding against Bad Things when doing multiple Put() within a timestep

@scottwittenburg
Copy link
Collaborator Author

Are you closing files after the write responsibility shifts? If not I can imagine that if the responsibility shifts back you might need to close and re-open to get a sane file pointer.

Currently each rank has a separate transport manager for every file it ends up writing to. I'm keeping each of those files open throughout the run, to avoid unnecessary Open() in case a rank ever switches back to a file it has already written to previously.

Does that seem like a bad idea? Maybe I need to seek to some location in that case? Or does WriteFileAt(DataVec.data(), DataVec.size(), m_StartDataPos) handle writing somewhere away from current fp?

@eisenhauer
Copy link
Member

2. guarding against Bad Things when doing multiple Put() within a timestep

Just to clarify this one, multiple Put() shouldn't change anything for you. You still get a single contiguous data block. it's calls to PerformDataWrite() that would potentially mess you up. That would result in multiple calls to WriteData() in a single timestep.

@scottwittenburg scottwittenburg force-pushed the data-size-based-aggregation branch from 62a70c4 to 9a1ca59 Compare July 3, 2025 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants