Data size based aggregation #4548

scottwittenburg · 2025-06-12T00:08:57Z

Goal: Add a new aggregation approach on the writer side, where ranks are grouped into MPIChains of roughly equal data size, for writing.

Add new helper module to handle partitioning of ranks based on data size. It should be easy to add new partitioning strategies, and eventually we should allow selecting one at runtime using config options
Add an MPIChain constructor that will put ranks into substreams based on a computed data-size based partitioning of the ranks
Update BP5Writer to collect some member fields into a new structure containing the things that should be cached. The purpose of this is to avoid opening the same files repeatedly for subsequent time steps, when it can be avoided.
Update BP5Writer initialization logic to support data-size based aggregation (which is off by default and enabled with a configuration setting).

scottwittenburg · 2025-06-12T00:11:56Z

fyi @pnorbert and @eisenhauer

This is my strawman for data-size based aggregation. Over the next couple days I plan to comment in various locations in the changeset to ask questions about direction this approach is taking, with the goal of getting your feedback.

scottwittenburg · 2025-06-17T18:37:01Z

source/adios2/engine/bp5/BP5Writer.cpp

-                                    useProfiler, *DataWritingComm);
-    }
-
+        // If we got a cache hit above, we'd like to skip the enclosed OpenFiles


This is one of the places where I tried to avoid opening files in the case of a cache hit, but it results in deadlock.

@pnorbert @eisenhauer

scottwittenburg · 2025-06-18T00:58:11Z

testing/adios2/engine/bp/CMakeLists.txt

@@ -121,6 +125,12 @@ bp_gtest_add_tests_helper(WriteReadAttributesMultirank MPI_ALLOW)
 bp_gtest_add_tests_helper(LargeMetadata MPI_ALLOW)
 bp5_gtest_add_tests_helper(WriteStatsOnly MPI_ALLOW)

+if (ADIOS2_HAVE_MPI)
+  gtest_add_tests_helper(DataSizeAggregate MPI_ONLY BP Engine.BP. .BP5.DSB
+    WORKING_DIRECTORY ${BP5_DIR} EXTRA_ARGS "DataSizeBased" 2 1


The extra args passed here are the aggregation type, the desired number of subfiles, and the number of time steps to iterate. Run with any number of subfiles and a single time step, the test passes. Run with 2 timesteps results in failure when trying to read the data back in.

2 steps failure details

$ /opt/mpi/mpich/4.1.2/install/bin/mpiexec "-n" "4" "/home/local/KHQ/scott.wittenburg/projects/EfficientInSituIO/fides_adios/adios2/build/bin/Test.Engine.BP.DataSizeAggregate.MPI" "--gtest_filter=DSATest.TestWriteUnbalancedData" "DataSizeBased" "2" "2" Note: Google Test filter = DSATest.TestWriteUnbalancedData [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from DSATest [ RUN ] DSATest.TestWriteUnbalancedData Note: Google Test filter = DSATest.TestWriteUnbalancedData [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from DSATest [ RUN ] DSATest.TestWriteUnbalancedData Note: Google Test filter = DSATest.TestWriteUnbalancedData [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from DSATest [ RUN ] DSATest.TestWriteUnbalancedData Note: Google Test filter = DSATest.TestWriteUnbalancedData [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from DSATest [ RUN ] DSATest.TestWriteUnbalancedData Number of timesteps: 2 Aggregation type: DataSizeBased Number of subfiles: 2 Rank data sizes (in number of columns) 1 2 3 4 --- Constructing BP5Writer (0)--- --- Constructing BP5Writer (2)--- --- Constructing BP5Writer (3)--- BP5Writer::0::BeginStep(StepMode::Append, -1) BP5Writer::2::BeginStep(StepMode::Append, -1) BP5Writer::3::BeginStep(StepMode::Append, -1) --- Constructing BP5Writer (1)--- BP5Writer::0::EndStep() BP5Writer::2::EndStep() BP5Writer::0::WriteData() InitAggregator() - DataSizeBased: Closing and re-opening MPIChain BP5Writer::2::WriteData() InitAggregator() - DataSizeBased: Closing and re-opening MPIChain BP5Writer::1::BeginStep(StepMode::Append, -1) BP5Writer::3::EndStep() BP5Writer::1::EndStep() BP5Writer::3::WriteData() InitAggregator() - DataSizeBased: Closing and re-opening MPIChain BP5Writer::1::WriteData() InitAggregator() - DataSizeBased: Closing and re-opening MPIChain Rank data sizes: [32, 64, 96, 128Paritioning resulted in 2 substreams: 0: [3 0 ], partition data size: 160 1: [2 1 ], partition data size: 160 Rank 0 - A Rank 0 - B Rank Rank 1 - A Rank 1 - B Rank 1 - D Rank 1 - F Rank 2 - A Rank 2 - B Rank 2 - D Rank 2 - F Rank 3 - A Rank 3 - B Rank 3 - D Rank 3 - F 0 - D Rank 0 - E Rank 0 - F Rank 0 - G Rank 0 - I Rank 1 - G Rank 1 - I Rank 1 - J Rank Rank 2 - G Rank 2 - I Rank 2 - J Rank 2 - K Rank 2 - L Rank 3 - G Rank 3 - I Rank 3 - J Rank 3 - K Rank 3 - L Rank 0 - J Rank 0 - K Rank 0 - L 1 - K Rank 1 - L Rank 0 - M Rank 0 - N Rank 0 - O Rank 1 - M Rank 1 - N Rank 1 InitBPBuffer() Rank 1 new writer needed, generate now, write later Rank 2 - M Rank 2 - N Rank 2 InitBPBuffer() Rank 3 - M Rank 3 - N Rank 3 InitBPBuffer() Rank 3 new writer needed, generate now, write later Rank 2 new writer needed, generate now, write later Rank 0 InitBPBuffer() Rank 0 new writer needed, generate now, write later BP5Writer::1::BeginStep(StepMode::Append, -1) BP5Writer::2::BeginStep(StepMode::Append, -1) BP5Writer::3::BeginStep(StepMode::Append, -1) BP5Writer::0::WriteMetaMetadata() BP5Writer::0::WriteMetadata() OVERLOAD2 BP5Writer::1::EndStep() BP5Writer::2::EndStep() BP5Writer::0::WriteMetadataFileIndex() BP5Writer::1::WriteData() BP5Writer::2::WriteData() BP5Writer::3::EndStep() BP5Writer::3::WriteData() InitAggregator() - DataSizeBased: Closing and re-opening MPIChain InitAggregator() - DataSizeBased: Closing and re-opening MPIChain InitAggregator() - DataSizeBased: Closing and re-opening MPIChain BP5Writer::0::BeginStep(StepMode::Append, -1) BP5Writer::0::EndStep() BP5Writer::0::WriteData() InitAggregator() - DataSizeBased: Closing and re-opening MPIChain Rank data sizes: [128, 64, 96, 32Paritioning resulted in 2 substreams: 0: [0 3 ], partition data size: 160 1: [2 1 ], partition data size: 160 Rank 1 cache hit for aggregator key 1-2-1-0 Rank 2 cache hit for aggregator key 1-2-0-1 Rank 2 - A Rank 2 - B Rank 2 - D Rank 2 - F Rank 0 - A Rank 0 - B Rank 0 - D Rank 0 - E Rank 1 - A Rank 1 - B Rank 1 - D Rank 1 - F Rank 3 - A Rank 3 - B Rank 3 - D Rank 3 - F Rank 0 - F Rank 0 - G Rank 0 - I Rank 2 - G Rank 2 - I Rank 2 - J Rank 2 - K Rank 2 - L Rank 3 - G Rank 3 - I Rank 3 - J Rank 3 - K Rank 3 - L Rank 0 - J Rank 0 - K Rank 0 - L Rank 1 - G Rank 1 - I Rank 1 - J Rank 1 - K Rank 1 - L Rank 0 - M Rank 0 - N Rank 0 - O Rank 3 - M Rank 3 - N Rank 3 InitBPBuffer() Rank 3 new writer needed, generate now, write later Rank 1 - M Rank 1 - N Rank 1 InitBPBuffer() Rank 1 new writer needed, generate now, write later Rank 2 - M Rank 2 - N Rank 2 InitBPBuffer() Rank 2 new writer needed, generate now, write later Rank 0 InitBPBuffer() Rank 0 new writer needed, generate now, write later BP5Writer::0::WriteMetaMetadata() BP5Writer::1::DoClose() BP5Writer::2::DoClose() BP5Writer::0::WriteMetadata() OVERLOAD2 BP5Writer::3::DoClose() BP5Writer::0::WriteMetadataFileIndex() BP5Writer::0::DoClose() Finished writing unbalanced data, reading it back unknown file: Failure C++ exception with description "[Tue Jun 17 17:29:40 2025] [ADIOS2 EXCEPTION] <Engine> <BP5Reader> <ParseMetadataIndex> : ADIOS2 BP5 Engine only supports bp format version 5, found 0 version " thrown in the test body. [ FAILED ] DSATest.TestWriteUnbalancedData (11 ms) unknown file: Failure C++ exception with description "[Tue Jun 17 17:29:40 2025] [ADIOS2 EXCEPTION] <Engine> <BP5Reader> <ParseMetadataIndex> : ADIOS2 BP5 Engine only supports bp format version 5, found 0 version " thrown in the test body. [ FAILED ] DSATest.TestWriteUnbalancedData (11 ms) [----------] 1 test from DSATest (11 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test suite ran. (11 ms total) [ PASSED ] 0 tests. [ FAILED ] unknown file: Failure C++ exception with description "[Tue Jun 17 17:29:40 2025] [ADIOS2 EXCEPTION] <Engine> <BP5Reader> <ParseMetadataIndex> : ADIOS2 BP5 Engine only supports bp format version 5, found 0 version " thrown in the test body. [ FAILED ] DSATest.TestWriteUnbalancedData (11 ms) [----------] 1 test from DSATest (11 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test suite ran. (11 ms total) [ PASSED ] 0 tests. [ FAILED ] 1 test, listed below: [ FAILED ] DSATest.TestWriteUnbalancedData 1 FAILED TEST unknown file: Failure C++ exception with description "[Tue Jun 17 17:29:40 2025] [ADIOS2 EXCEPTION] <Engine> <BP5Reader> <ParseMetadataIndex> : ADIOS2 BP5 Engine only supports bp format version 5, found 0 version " thrown in the test body. [ FAILED ] DSATest.TestWriteUnbalancedData (11 ms) [----------] 1 test from DSATest (11 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test suite ran. (11 ms total) [ PASSED ] 0 tests. [ FAILED ] 1 test, listed below: [ FAILED ] DSATest.TestWriteUnbalancedData 1 FAILED TEST [----------] 1 test from DSATest (11 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test suite ran. (11 ms total) [ PASSED ] 0 tests. [ FAILED ] 1 test, listed below: [ FAILED ] DSATest.TestWriteUnbalancedData 1 FAILED TEST 1 test, listed below: [ FAILED ] DSATest.TestWriteUnbalancedData 1 FAILED TEST

You might notice in that case that I ran with 4 processes, and that the greedy partitioning strategy as well as my naive choice of cache key played a role in two ranks having a cache hit on the 2nd time step, and the other two having cache misses. So while all ranks write to the same file for t=2 as they did for t=1, because the partitioning strategy re-ordered the ranks in one of the substreams, and because the cache key includes the substream rank order, those two ranks think they need to open new files.

Also note the only reason there is no deadlock when running for 2 timesteps, is because currently the InitTransports() method is doing all the normal init steps, regardless of whether there was a cache hit or not, specifically to avoid deadlocks.

Looking at the bp produces when running with 2 timesteps, bpls produces output that agrees with the testing code:

bpls output

$ ~/projects/EfficientInSituIO/fides_adios/adios2/install/bin/bpls -d -s "0,0" -c "-1,-1" ./unbalanced_output.bp Failed to open with BPFile engine: [Tue Jun 17 17:31:35 2025] [ADIOS2 EXCEPTION] <Engine> <BP5Reader> <ParseMetadataIndex> : ADIOS2 BP5 Engine only supports bp format version 5, found 0 version Error: Could not open this file with any ADIOS2 file reading engines

And bp5dbg has an even harder time with the 2-timestep output:

bp5dbg output

$ PYTHONPATH=~/projects/EfficientInSituIO/fides_adios/adios2/install/local/lib/python3.10/dist-packages ~/projects/EfficientInSituIO/fides_adios/adios2/install/bin/bp5dbg ./unbalanced_output.bp ======================================================== Index Table File: ./unbalanced_output.bp/md.idx ======================================================== ------------------------------------------------------------------------------------------------------------------ | Version string |Major|Minor|Patch|unused|Endian|BP version|BP minor|Active|ColumnMajor|unused| | 32 bytes | 1B | 1B | 1B | 1B | 1B | 1B | 1B | 1B | 1B | 23B | +----------------------------------+-----+-----+-----+------+------+----------+--------+------+-----------+------+ | | | | | | yes | 0 | 0 | no | yes | | ------------------------------------------------------------------------------------------------------------------ -------------------------------------------------------------------------------------------------- Record '', length = [0] Unknown record , lenght = [0] -------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------- Record '', length = [0] Unknown record , lenght = [0] -------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------- Record '', length = [0] Unknown record , lenght = [0] -------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------- Record '', length = [0] Unknown record , lenght = [0] -------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------- Record '', length = [0] Unknown record , lenght = [0] -------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------- Record '', length = [0] Unknown record , lenght = [0] -------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------- Record '', length = [0] Unknown record , lenght = [0] -------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------- Record '', length = [0] Unknown record , lenght = [0] -------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------- Record '', length = [0] Unknown record , lenght = [0] -------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------- Record '', length = [0] Unknown record , lenght = [0] -------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------- Record '', length = [0] Unknown record , lenght = [0] -------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------- Record '', length = [0] Unknown record , lenght = [0] -------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------- Record '', length = [0] Unknown record , lenght = [0] -------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------- Record '', length = [0] Unknown record , lenght = [0] -------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------- Record '', length = [242514657280] Unknown record , lenght = [242514657280] -------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------- Record '', length = [67108864] Unknown record , lenght = [67108864] -------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------- Record '', length = [0] Unknown record , lenght = [0] -------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------- Record '', length = [512] Unknown record , lenght = [512] -------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------- Record '', length = [0] Unknown record , lenght = [0] -------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------- Record '', length = [72057594037927936] Unknown record , lenght = [72057594037927936] -------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------- Record '', length = [0] Unknown record , lenght = [0] -------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------- Record '', length = [15889042532990976] Unknown record , lenght = [15889042532990976] -------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------- Record '', length = [1099511627776000] Unknown record , lenght = [1099511627776000] -------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------- Record '', length = [4294967296000] Unknown record , lenght = [4294967296000] -------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------- Record '', length = [0] Unknown record , lenght = [0] -------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------- Record '', length = [536870912] Unknown record , lenght = [536870912] -------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------- Record '', length = [3145728] Unknown record , lenght = [3145728] -------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------- Record '', length = [8192] Unknown record , lenght = [8192] -------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------- Traceback (most recent call last): File "/home/local/KHQ/scott.wittenburg/projects/EfficientInSituIO/fides_adios/adios2/install/bin/bp5dbg", line 155, in <module> status = DumpIndexTableFile(args) File "/home/local/KHQ/scott.wittenburg/projects/EfficientInSituIO/fides_adios/adios2/install/bin/bp5dbg", line 93, in DumpIndexTableFile status, MetadataIndexTable, WriterMap = DumpIndexTable( File "/home/local/KHQ/scott.wittenburg/projects/EfficientInSituIO/fides_adios/adios2/install/local/lib/python3.10/dist-packages/adios2/bp5dbg/idxtable.py", line 156, in DumpIndexTable status, MetadataIndexTable, WriterMap = ReadIndex( File "/home/local/KHQ/scott.wittenburg/projects/EfficientInSituIO/fides_adios/adios2/install/local/lib/python3.10/dist-packages/adios2/bp5dbg/idxtable.py", line 64, in ReadIndex reclen = np.frombuffer(table, dtype=np.uint64, count=1, ValueError: buffer is smaller than requested size

And finally, note that I think the test logic itself is more or less sound for 2 timesteps, because if you provide EveryoneWritesSerial as the first argument, the test passes, and bpls / bp5dbg produce reasonable-looking (to me) output.

It appears to me from the output of the test / bpls / bp5dbg, that the error is coming from here and thus possibly caused by some problem in the metadata index file?

And to make sure I understand, you would only expect this problem to rear its head when we exceed OneLevelGatherRanksLimit?

Yes, I believe that that's true...

I'm mulling those first two options. Upon first consideration, option 1 seems more attractive since it seems easier to implement (but maybe I just haven't seen far enough along that path yet).

Yes, I believe that that's true...

And when it happens, how do you expect it would manifest itself? I'm just looking for a rough idea here, like would it segfault when writing, or just produce data that can't be read back in correctly?

Maybe a segfault when writing, but definitely bad data on read (unless you got lucky). That "per rank metadata size" array is gathered in a one-level gather, so the size values will be in a canonical order. It's then used to create the per-rank size values in the subsequent gathervs, be they one or two level. If we end up associating a size with the wrong rank we'll gather trash data, either too little from one rank or too much from another. The latter might be a segfault if it was unmapped memory, but even if it didn't blow up on the writer, we'd be calculating the wrong start positions for metadata on the reader and blow up when trying to parse. (Actually, you'd likely get an FFS error about unknown formats because it wouldn't find the right header when starting the parse...)

I'm going back to consider the two-level metadata aggregation, and want to get more clarification on the problem in general, and one of those approaches you mentioned specifically.

Generally, I'm not sure I see in the code how the order of the initial data aggregator/mpichains has anything to do with the metadata aggregator/mpichains. They are currently separate aggregators, and as far as I can see, their only connection to each other is that they both have MPI_COMM_WORLD (or whatever communicator is passed to the writer's constructor) as their parent communicator.

And regarding option 1 specifically, you said:

First, we could separate the aggregation communicators used for data from that used for metadata. There is no logical relationship between those operations, so by sticking with the original plan for metadata communication we avoid all these issues. (And I don't think there's overhead associated with having multiple subcommunicators.).

When you say "original plan" here, do you mean how metadata is aggregated before this PR? Or is it rather something we discussed in a meeting? So far in this PR I think I tried not to touch how metadata is aggregated, so is it possible this is already a non-issue?

If those aggregators are disjoint, then it's a non-issue. I was under the impression that we were using the same aggregators for both data and metadata aggregation when metadata aggregation was multi-level. If we're not, then no problems.

scottwittenburg · 2025-06-18T22:38:12Z

With my current implementation, I have found a broken case when 3 of 5 ranks get an aggregator cache hit, while 2 of 5 do not.

I'll try to describe the problem by first showing the expected outputs when the aggregation type is EveryoneWritesSerial, and then compare that to the outputs when the aggretation type is DataSizeBased.

In both cases I run with 5 mpi processes, 3 subfiles, and 2 timesteps.

First look at the data produced when using EveryoneWritesSerial, and note that the test passes in the process of producing the data.

EveryoneWritesSerial - bpls

$ ~/projects/EfficientInSituIO/fides_adios/adios2/install/bin/bpls -d -D ./unbalanced_output.bp/
  uint64_t  GlobalArray  2*{5, 15}
        step 0: 
          block 0: [0:4,  0: 0]
    (0,0)    0 15 30 45 60 
          block 1: [0:4,  1: 2]
    (0,0)    1 2 16 17 31 32
    (3,0)    46 47 61 62 
          block 2: [0:4,  3: 5]
    (0,0)    3 4 5 18 19 20
    (2,0)    33 34 35 48 49 50
    (4,0)    63 64 65 
          block 3: [0:4,  6: 9]
    (0,0)    6 7 8 9 21 22
    (1,2)    23 24 36 37 38 39
    (3,0)    51 52 53 54 66 67
    (4,2)    68 69 
          block 4: [0:4, 10:14]
    (0,0)    10 11 12 13 14 25
    (1,1)    26 27 28 29 40 41
    (2,2)    42 43 44 55 56 57
    (3,3)    58 59 70 71 72 73
    (4,4)    74 
        step 1: 
          block 0: [0:4,  0: 4]
    (0,0)    74 75 76 77 78 89
    (1,1)    90 91 92 93 104 105
    (2,2)    106 107 108 119 120 121
    (3,3)    122 123 134 135 136 137
    (4,4)    138 
          block 1: [0:4,  5: 6]
    (0,0)    79 80 94 95 109 110
    (3,0)    124 125 139 140 
          block 2: [0:4,  7: 9]
    (0,0)    81 82 83 96 97 98
    (2,0)    111 112 113 126 127 128
    (4,0)    141 142 143 
          block 3: [0:4, 10:13]
    (0,0)    84 85 86 87 99 100
    (1,2)    101 102 114 115 116 117
    (3,0)    129 130 131 132 144 145
    (4,2)    146 147 
          block 4: [0:4, 14:14]
    (0,0)    88 103 118 133 148

EveryoneWritesSerial - bp5dbg

$ PYTHONPATH=~/projects/EfficientInSituIO/fides_adios/adios2/install/local/lib/python3.10/dist-packages ~/projects/EfficientInSituIO/fides_adios/adios2/install/bin/bp5dbg ./unbalanced_output.bp/
========================================================
    Index Table File: ./unbalanced_output.bp//md.idx
========================================================
------------------------------------------------------------------------------------------------------------------
|        Version string            |Major|Minor|Patch|unused|Endian|BP version|BP minor|Active|ColumnMajor|unused|
|          32 bytes                |  1B |  1B |  1B |  1B  |  1B  |    1B    |   1B   |  1B  |    1B     |  23B |
+----------------------------------+-----+-----+-----+------+------+----------+--------+------+-----------+------+
| ADIOS-BP v2.:.0 Index Table      |  2  |  :  |  0  |      | yes  |     5    |   2    |  no  |    no     |      |
------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record 'w', length = [64]
  WriterMap: Writers = 5  Aggregators = 3  Subfiles = 3
  =====================
  |  Rank  |  Subfile |
  ---------------------
  |      0 |        0 |
  |      1 |        0 |
  |      2 |        1 |
  |      3 |        1 |
  |      4 |        2 |
  =====================
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record 's', length = [64]
|   Step = 0     | MetadataPos = 0          |  MetadataSize = 1248         | FlushCount = 0  |
 Writer 0 data loc:0
 Writer 1 data loc:4096
 Writer 2 data loc:0
 Writer 3 data loc:4096
 Writer 4 data loc:0
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record 's', length = [64]
|   Step = 1     | MetadataPos = 1248       |  MetadataSize = 1248         | FlushCount = 0  |
 Writer 0 data loc:8192
 Writer 1 data loc:12288
 Writer 2 data loc:8192
 Writer 3 data loc:12288
 Writer 4 data loc:4096
--------------------------------------------------------------------------------------------------
========================================================
    MetaMetadata File: ./unbalanced_output.bp/mmd.0
========================================================
 File size = 628
 --------------------------------------------------
 |  Record |  Offset  |  ID length |  Info length |
 --------------------------------------------------
 |       0 |        0 |         12 |          600 |
 --------------------------------------------------
========================================================
    Metadata File: ./unbalanced_output.bp/md.0
========================================================
Step 0: 
  Offset = 0
  Size = 1240
  Variable metadata entries: 
    Writer 0: md size 232 offset 80
    Writer 1: md size 232 offset 312
    Writer 2: md size 232 offset 544
    Writer 3: md size 232 offset 776
    Writer 4: md size 232 offset 1008
  Attribute metadata entries: 
    Writer 0: md size 0 offset 1240
    Writer 1: md size 0 offset 1240
    Writer 2: md size 0 offset 1240
    Writer 3: md size 0 offset 1240
    Writer 4: md size 0 offset 1240
========================================================
Step 1: 
  Offset = 1248
  Size = 1240
  Variable metadata entries: 
    Writer 0: md size 232 offset 1328
    Writer 1: md size 232 offset 1560
    Writer 2: md size 232 offset 1792
    Writer 3: md size 232 offset 2024
    Writer 4: md size 232 offset 2256
  Attribute metadata entries: 
    Writer 0: md size 0 offset 2488
    Writer 1: md size 0 offset 2488
    Writer 2: md size 0 offset 2488
    Writer 3: md size 0 offset 2488
    Writer 4: md size 0 offset 2488

Now look at the data produced when using DataSizeBased (which is associated with test failure).

DataSizeBased - bpls

$ ~/projects/EfficientInSituIO/fides_adios/adios2/install/bin/bpls -d -D ./unbalanced_output.bp/
  uint64_t  GlobalArray  2*{5, 15}
        step 0: 
          block 0: [0:4,  0: 0]
    (0,0)    0 0 0 0 0 
          block 1: [0:4,  1: 2]
    (0,0)    1 2 16 17 31 32
    (3,0)    46 47 61 62 
          block 2: [0:4,  3: 5]
    (0,0)    3 4 5 18 19 20
    (2,0)    33 34 35 48 49 50
    (4,0)    63 64 65 
          block 3: [0:4,  6: 9]
    (0,0)    0 0 0 0 0 0
    (1,2)    0 0 0 0 0 0
    (3,0)    0 0 0 0 0 0
    (4,2)    0 0 
          block 4: [0:4, 10:14]
    (0,0)    0 0 0 0 0 0
    (1,1)    0 0 0 0 0 0
    (2,2)    0 0 0 0 0 0
    (3,3)    0 0 0 0 0 0
    (4,4)    0 
        step 1: 
          block 0: [0:4,  0: 4]
    (0,0)    74 75 76 77 78 89
    (1,1)    90 91 92 93 104 105
    (2,2)    106 107 108 119 120 121
    (3,3)    122 123 134 135 136 137
    (4,4)    138 
          block 1: [0:4,  5: 6]
    (0,0)    79 80 94 95 109 110
    (3,0)    124 125 139 140 
          block 2: [0:4,  7: 9]
    (0,0)    81 82 83 96 97 98
    (2,0)    111 112 113 126 127 128
    (4,0)    141 142 143 
          block 3: [0:4, 10:13]
    (0,0)    0 0 0 0 0 0
    (1,2)    0 0 0 0 0 0
    (3,0)    0 0 0 0 0 0
    (4,2)    0 0 
          block 4: [0:4, 14:14]
    (0,0)    88 103 118 133 148

DataSizeBased - bp5dbg

$ PYTHONPATH=~/projects/EfficientInSituIO/fides_adios/adios2/install/local/lib/python3.10/dist-packages ~/projects/EfficientInSituIO/fides_adios/adios2/install/bin/bp5dbg ./unbalanced_output.bp/
========================================================
    Index Table File: ./unbalanced_output.bp//md.idx
========================================================
------------------------------------------------------------------------------------------------------------------
|        Version string            |Major|Minor|Patch|unused|Endian|BP version|BP minor|Active|ColumnMajor|unused|
|          32 bytes                |  1B |  1B |  1B |  1B  |  1B  |    1B    |   1B   |  1B  |    1B     |  23B |
+----------------------------------+-----+-----+-----+------+------+----------+--------+------+-----------+------+
| ADIOS-BP v2.:.0 Index Table      |  2  |  :  |  0  |      | yes  |     5    |   2    |  no  |    no     |      |
------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record 'w', length = [64]
  WriterMap: Writers = 5  Aggregators = 0  Subfiles = 3
  =====================
  |  Rank  |  Subfile |
  ---------------------
  |      0 |        1 |
  |      1 |        2 |
  |      2 |        2 |
  |      3 |        1 |
  |      4 |        0 |
  =====================
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record 's', length = [64]
|   Step = 0     | MetadataPos = 0          |  MetadataSize = 1248         | FlushCount = 0  |
 Writer 0 data loc:4096
 Writer 1 data loc:4096
 Writer 2 data loc:0
 Writer 3 data loc:0
 Writer 4 data loc:0
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record 'w', length = [64]
  WriterMap: Writers = 5  Aggregators = 0  Subfiles = 3
  =====================
  |  Rank  |  Subfile |
  ---------------------
  |      0 |        0 |
  |      1 |        2 |
  |      2 |        2 |
  |      3 |        1 |
  |      4 |        1 |
  =====================
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record 's', length = [64]
|   Step = 1     | MetadataPos = 1248       |  MetadataSize = 1248         | FlushCount = 0  |
 Writer 0 data loc:8192
 Writer 1 data loc:12288
 Writer 2 data loc:8192
 Writer 3 data loc:8192
 Writer 4 data loc:12288
--------------------------------------------------------------------------------------------------
========================================================
    MetaMetadata File: ./unbalanced_output.bp/mmd.0
========================================================
 File size = 628
 --------------------------------------------------
 |  Record |  Offset  |  ID length |  Info length |
 --------------------------------------------------
 |       0 |        0 |         12 |          600 |
 --------------------------------------------------
========================================================
    Metadata File: ./unbalanced_output.bp/md.0
========================================================
Step 0: 
  Offset = 0
  Size = 1240
  Variable metadata entries: 
    Writer 0: md size 232 offset 80
    Writer 1: md size 232 offset 312
    Writer 2: md size 232 offset 544
    Writer 3: md size 232 offset 776
    Writer 4: md size 232 offset 1008
  Attribute metadata entries: 
    Writer 0: md size 0 offset 1240
    Writer 1: md size 0 offset 1240
    Writer 2: md size 0 offset 1240
    Writer 3: md size 0 offset 1240
    Writer 4: md size 0 offset 1240
========================================================
Step 1: 
  Offset = 1248
  Size = 1240
  Variable metadata entries: 
    Writer 0: md size 232 offset 1328
    Writer 1: md size 232 offset 1560
    Writer 2: md size 232 offset 1792
    Writer 3: md size 232 offset 2024
    Writer 4: md size 232 offset 2256
  Attribute metadata entries: 
    Writer 0: md size 0 offset 2488
    Writer 1: md size 0 offset 2488
    Writer 2: md size 0 offset 2488
    Writer 3: md size 0 offset 2488
    Writer 4: md size 0 offset 2488

Note when reading the output above that the size based partitioning produced at t=1 was [[4],[3,0],[2,1]], while at t=2 the partitioning was [[0],[3,4],[2,1]]. Considering that information there's at least one noticeable problem: The Writer 0 data loc: 8192 for the second timestep seems like it should actually be Writer 0 data loc: 4096 since only rank 0 wrote to file 0 on the previous time step.

Any idea where I should be looking to correct that problem? Seems like it's in the metadata index file, but beyond that, I'm not sure. Also, it appears to me that the other writer data locations for timestep 2 are correct, which is confusing, unless I'm interpreting the data wrong.

One last note: when repeating the above tests with only one time step, the test passes in both cases, and the data looks fine To me that seems to indicate the problem is related to repeating some of the initialization in InitTransports() and InitBPBuffer() more than once.

eisenhauer · 2025-06-19T20:31:48Z

Hmm. A quick question. Are you using SelectiveAggregationMetadata? I.E. is the UseSelectiveMetadataAggregation engine parameter in its default "true" value and the first branch of this if in BP5Writer being taken?

    if (m_Parameters.UseSelectiveMetadataAggregation)
    {
        SelectiveAggregationMetadata(TSInfo);
    }
    else
    {
        TwoLevelAggregationMetadata(TSInfo);
    }

Because if not, if you're falling into TwoLevelAggregationMetadata, then all the unfortunate effects that I might have anticipated for >6K nodes are hitting you now...

scottwittenburg · 2025-06-19T20:40:02Z

A quick question. Are you using SelectiveAggregationMetadata?

Ah yes, great question, I assume you asked because you noticed the re-ordering between timesteps. I just confirmed that in this case, it's taking the SelectiveAggregationMetadata path.

scottwittenburg · 2025-06-19T21:07:46Z

The aggregator rank in each mpi chain receives the final data position from the last rank in the chain. Is that because there is an assumption there that the aggregator rank is going to be writing to the same file the next time around?

eisenhauer · 2025-06-19T21:39:30Z

Just wanted to make sure, because the TwoLevel stuff uses those aggregators and if they're changing, potentially bad things. Also, as a debugging aid, make sure this #ifdef DUMPDATALOCINFO evaluates to true so that you get debugging output from writer rank 0 without having to rely on bp5dbg. I'm assuming that you're not doing any data flushes, so what happens in FlushData is irrelevant. So you'll be tracking the m_StartDataPos that comes into SelectiveAggregationMetadata. Whatever happens in WriteData, that should accurately reflect where that rank's data starts in whatever file is landed in. If it doesn't, nothing else will go well after that.

eisenhauer · 2025-06-19T23:15:32Z

Something else to note while I'm thinking about it. You'll notice that WriteData() is also being called in FlushData(). FlushData is a user-visible BP5 call that allows the application to write partial timestep data in the middle of a timestep (I.E. before EndStep()). This is useful if you are pushing memory limits and don't have room to buffer. So, for example you could do BeginStep(),Put(),Put(),FlushData(),Put(),Put(),FlushData(),Put(),EndStep(). Each of those flushes and the end step results in data (but not metadata) going to disk. The FlushPosSizeInfo vector on rank 0 accumulates all the info from each rank for where it's data landed and how much data it was on each Flush. Note that as a natural consequence of writing piece-wise the data from rank X is not contiguous in the data file, but broken up into multiple segments. In order to find the data at Offset M, timestep T, from rank N, we have to have the offset and size of each chunk, so this all ends up in a single metadataindex record for that timestep. The relevant bit for you is that WriterMapRecords only occur between timesteps. Which means that we don't have the ability to let a rank change what file it is writing to in the middle of a timestep. So, the first time WriteData() is called in a timestep (which might be in a FlushData or in EndStep), you are free to change the WriterSubfileMap. But you should not change it again in that timestep. You're probably not playing with those examples yet, but this is something that will require a little bit of extra logic to make sure it doesn't happen.

scottwittenburg · 2025-06-25T22:36:23Z

source/adios2/engine/bp5/BP5Writer.cpp

+            // Stuff we only want to do on the first timestep
+            m_SubstreamDataPos.resize(m_Parameters.NumSubFiles);
+        }
+        m_CommAggregators =


It appeared to me that m_CommAggregators was split off from the world, and then never used anywhere, so I decided to re-purpose it. Hopefully that's acceptable and I didn't misread the code.

source/adios2/engine/bp5/BP5Writer.cpp

scottwittenburg · 2025-06-25T23:40:43Z

By the way, on the master branch, when I #define DUMPDATALOCINFO, this is what it outputs for the first time step:

Extra debug output:

315: Flush count is :0
315: Write Index positions = {
315: Writer 0 has data at: 
315: loc:O
315: Writer 1 has data at: 
315: loc:S
315: Writer 2 has data at: 
315: loc:-
315: Writer 3 has data at: 
315: loc:B
315: Writer 4 has data at: 
315: loc:P
315: Writer 5 has data at: 
315: loc: 
315: Writer 6 has data at: 
315: loc:v
315: Writer 7 has data at: 
315: loc:2
315: Writer 8 has data at: 
315: loc:.
315: Writer 9 has data at: 
315: loc::
315: Writer 10 has data at: 
315: loc:.
315: Writer 11 has data at: 
315: loc:0
315: Writer 12 has data at: 
315: loc: 
315: Writer 13 has data at: 
315: loc:I
315: Writer 14 has data at: 
315: loc:n
315: Writer 15 has data at: 
315: loc:d
315: Writer 16 has data at: 
315: loc:e
315: Writer 17 has data at: 
315: loc:x
315: Writer 18 has data at: 
315: loc: 
315: Writer 19 has data at: 
315: loc:T
315: Writer 20 has data at: 
315: loc:a
315: Writer 21 has data at: 
315: loc:b
315: Writer 22 has data at: 
315: loc:l
315: Writer 23 has data at: 
315: loc:e

For subsequent time steps, it seems to print nothing instead of those characters from the buffer. I wonder if that block didn't get updated when some other bit of the code did, due to it not getting exercised and validated?

eisenhauer · 2025-06-26T00:09:58Z

For subsequent time steps, it seems to print nothing instead of those characters from the buffer. I wonder if that block didn't get updated when some other bit of the code did, due to it not getting exercised and validated?

Ah, yes, that's probably the case. Those items are size_t and maybe the type of buf changed. Which may mean those index calcs into it are off too. I can't look tonight, but will check tomorrow.

eisenhauer · 2025-06-26T13:43:42Z

OK, try substituting in this code into the matching bits of WriteMetadataFileIndex()

    // Step record
    record = StepRecord;
#ifndef DUMPDATALOCINFO
    size_t StepRecordStartPos = pos;
#endif
    helper::CopyToBuffer(buf, pos, &record, 1); // record type
    d = (3 + ((FlushPosSizeInfo.size() * 2) + 1) * m_Comm.Size()) * sizeof(uint64_t);
    helper::CopyToBuffer(buf, pos, &d, 1); // record length
    helper::CopyToBuffer(buf, pos, &MetaDataPos, 1);
    helper::CopyToBuffer(buf, pos, &MetaDataSize, 1);
    d = static_cast<uint64_t>(FlushPosSizeInfo.size());
    helper::CopyToBuffer(buf, pos, &d, 1);

    for (int writer = 0; writer < m_Comm.Size(); writer++)
    {
        for (size_t flushNum = 0; flushNum < FlushPosSizeInfo.size(); flushNum++)
        {
            // add two numbers here
            helper::CopyToBuffer(buf, pos, &FlushPosSizeInfo[flushNum][2 * writer], 2);
        }
        helper::CopyToBuffer(buf, pos, &m_WriterDataPos[writer], 1);
    }

    m_FileMetadataIndexManager.WriteFiles((char *)buf.data(), buf.size());
#ifndef DUMPDATALOCINFO
    std::cout << "WriterMapRecordType is: " << (buf.data() + StepRecordStartPos)[0] << std::endl;
    size_t *BufPtr = (size_t*)(buf.data() + StepRecordStartPos + 1);
    std::cout << "WriterMapRecordLength is: " << *BufPtr++ << std::endl;
    std::cout << "MetadataPos is: " << *BufPtr++ << std::endl;
    std::cout << "MetadataSize is: " << *BufPtr++ << std::endl;
    std::cout << "Flush count is :" << *BufPtr++ << std::endl;
    std::cout << "Write Index positions = {" << std::endl;

    for (size_t i = 0; i < m_Comm.Size(); ++i)
    {
        std::cout << "Writer " << i << " has data at: " << std::endl;
        uint64_t eachWriterSize = FlushPosSizeInfo.size() * 2 + 1;
        for (size_t j = 0; j < FlushPosSizeInfo.size(); ++j)
        {
            std::cout << "loc:" << *BufPtr++;
	    std::cout << " siz:" << *BufPtr++ << std::endl;
        }
        std::cout << "loc:" << *BufPtr++ << std::endl;
    }
    std::cout << "}" << std::endl;
#endif
    m_FileMetadataIndexManager.FlushFiles();

(I won't put these changes into a PR now because that would just make merging your code more difficult when it comes to that.)

scottwittenburg · 2025-06-26T16:59:55Z

OK, try substituting in this code into the matching bits of WriteMetadataFileIndex()

Thanks Greg, that works and jives with both the extra debug printing I was doing, as well as with what bp5dbg shows. Having this is nice though because I have been printing the source information and what you have here confirms that is transferred to the buffer correctly.

So I'm just curious if this summary below 1) contains all the relevant information you would expect is needed to indicate correct writing of the data, and 2) if it actually looks correct to you based on the writer chains at each time step (it looks correct to me).

Debug output showing writer chains and metadata

Step 0:

    Rank data sizes: [48, 80, 128, 160, 208]
    Paritioning resulted in 3 substreams:
    0: [4 ]
    1: [3 0 ]
    2: [2 1 ]

    WriterMapRecordType is: s
    WriterMapRecordLength is: 64
    MetadataPos is: 0
    MetadataSize is: 1248
    Flush count is :0

    Write Index positions = {
    Writer 0 has data at:
    loc:4096
    Writer 1 has data at:
    loc:4096
    Writer 2 has data at:
    loc:0
    Writer 3 has data at:
    loc:0
    Writer 4 has data at:
    loc:0
    }

Step 1:

    Rank data sizes: [208, 80, 128, 160, 48]
    Paritioning resulted in 3 substreams:
    0: [0 ]
    1: [3 4 ]
    2: [2 1 ]

    WriterMapRecordType is: s
    WriterMapRecordLength is: 64
    MetadataPos is: 1248
    MetadataSize is: 1248
    Flush count is :0

    Write Index positions = {
    Writer 0 has data at:
    loc:4096
    Writer 1 has data at:
    loc:12288
    Writer 2 has data at:
    loc:8192
    Writer 3 has data at:
    loc:8192
    Writer 4 has data at:
    loc:12288
    }

And here is the comparison of what is produced by EveryoneWritesSerial aggregation vs DataSizeBased:

Side-by-side bpls output

       EveryoneWritesSerial          |             DataSizeBased
-----------------------------------  |  -----------------------------------
  uint64_t  GlobalArray  2*{5, 15}   |    uint64_t  GlobalArray  2*{5, 15}
        step 0:                      |          step 0:
          block 0: [0:4,  0: 0]      |            block 0: [0:4,  0: 0]
    (0,0)    0 15 30 45 60           |      (0,0)    0 0 0 0 0
          block 1: [0:4,  1: 2]      |            block 1: [0:4,  1: 2]
    (0,0)    1 2 16 17 31 32         |      (0,0)    1 2 16 17 31 32
    (3,0)    46 47 61 62             |      (3,0)    46 47 61 62
          block 2: [0:4,  3: 5]      |            block 2: [0:4,  3: 5]
    (0,0)    3 4 5 18 19 20          |      (0,0)    3 4 5 18 19 20
    (2,0)    33 34 35 48 49 50       |      (2,0)    33 34 35 48 49 50
    (4,0)    63 64 65                |      (4,0)    63 64 65
          block 3: [0:4,  6: 9]      |            block 3: [0:4,  6: 9]
    (0,0)    6 7 8 9 21 22           |      (0,0)    0 0 0 0 0 0
    (1,2)    23 24 36 37 38 39       |      (1,2)    0 0 0 0 0 0
    (3,0)    51 52 53 54 66 67       |      (3,0)    0 0 0 0 0 0
    (4,2)    68 69                   |      (4,2)    0 0
          block 4: [0:4, 10:14]      |            block 4: [0:4, 10:14]
    (0,0)    10 11 12 13 14 25       |      (0,0)    0 0 0 0 0 0
    (1,1)    26 27 28 29 40 41       |      (1,1)    0 0 0 0 0 0
    (2,2)    42 43 44 55 56 57       |      (2,2)    0 0 0 0 0 0
    (3,3)    58 59 70 71 72 73       |      (3,3)    0 0 0 0 0 0
    (4,4)    74                      |      (4,4)    0
        step 1:                      |          step 1:
          block 0: [0:4,  0: 4]      |            block 0: [0:4,  0: 4]
    (0,0)    74 75 76 77 78 89       |      (0,0)    74 75 76 77 78 89
    (1,1)    90 91 92 93 104 105     |      (1,1)    90 91 92 93 104 105
    (2,2)    106 107 108 119 120 121 |      (2,2)    106 107 108 119 120 121
    (3,3)    122 123 134 135 136 137 |      (3,3)    122 123 134 135 136 137
    (4,4)    138                     |      (4,4)    138
          block 1: [0:4,  5: 6]      |            block 1: [0:4,  5: 6]
    (0,0)    79 80 94 95 109 110     |      (0,0)    79 80 94 95 109 110
    (3,0)    124 125 139 140         |      (3,0)    124 125 139 140
          block 2: [0:4,  7: 9]      |            block 2: [0:4,  7: 9]
    (0,0)    81 82 83 96 97 98       |      (0,0)    81 82 83 96 97 98
    (2,0)    111 112 113 126 127 128 |      (2,0)    111 112 113 126 127 128
    (4,0)    141 142 143             |      (4,0)    141 142 143
          block 3: [0:4, 10:13]      |            block 3: [0:4, 10:13]
    (0,0)    84 85 86 87 99 100      |      (0,0)    0 0 0 0 0 0
    (1,2)    101 102 114 115 116 117 |      (1,2)    0 0 0 0 0 0
    (3,0)    129 130 131 132 144 145 |      (3,0)    0 0 0 0 0 0
    (4,2)    146 147                 |      (4,2)    0 0
          block 4: [0:4, 14:14]      |            block 4: [0:4, 14:14]
    (0,0)    88 103 118 133 148      |      (0,0)    88 103 118 133 148

Let me know if you spot something that looks wrong, or if I haven't included complete details.

eisenhauer · 2025-06-26T17:22:34Z

The metadatafile index stuff does look like what I'd expect. However, things obviously aren't right even with just a single timestep, so something is clearly going badly. I'd like to try to reproduce this output and would be hopeful that I if I step through with the debugger on the reader side I might see what the issue is. Can you tell me what/how you're running on the writer side so I can give that a go?

scottwittenburg · 2025-06-26T17:38:47Z

However, things obviously aren't right even with just a single timestep,

Well, if I run the test with just a single timestep, everything looks fine and works. But when I run the test with two timesteps, somehow the first timestep get's messed up in retrospect or something.

so something is clearly going badly

😅 That's for sure!

Can you tell me what/how you're running on the writer side so I can give that a go?

That would be so great, thank you! I'm just running variations of one of the new tests I added, which you can run with all the defaults as:

ctest -R Engine.BP.DSATest.TestWriteUnbalancedData.BP5.DSB.MPI --verbose

The --verbose argument to ctest shows the underlying command it runs, and you can just run that if you want to change the number of mpi processes or any of the default arguments (which are: aggregation type, number of subfiles, and number of time steps). This is what I'm doing to produce the broken bp file: 5 ranks, DataSizeBased aggregation, 3 subfiles, 2 timesteps. Here'e the command:

mpiexec "-n" "5" \
    "./bin/Test.Engine.BP.DataSizeAggregate.MPI" \
    "--gtest_filter=DSATest.TestWriteUnbalancedData" \
    "DataSizeBased" "3" "2"

Many combinations of parameters don't result in failure or messed up output file like the example above does. I suspect it's probably related to how some of the ranks change the subfile they're writing between steps. You can try with 1 timestep instead of 2, and should see that it works.

Note: I need to rm -rf ./unbalanaced_output.bp from the build tree between runs, or it sometimes causes the test to fail. Maybe because some of the files don't get overwritten? Or running with more ranks and subsequently fewer leaves files in the directory that break things? 🤷

eisenhauer · 2025-06-26T17:53:00Z

OK, I'll give this a go. I've got a couple of calls yet this afternoon, so it won't be right away. Just FYI, in case you get there first, my plan is to look at what is happening in BP5Reader::ReadData(). The BP5Deserializer looks at the Get()s that happen in a timestep and translate those into a ReadList of data that it needs. That list looks like : from WriterRank W data from Timestep T, read Length bytes from Offset S. ReadData()'s responsibility is to translate that into actual data access. One of the first things it does is to translate W and T to a SubfileNum. Looking at what's going wrong there should be informative.

scottwittenburg · 2025-06-26T17:57:06Z

Ok thanks for that information, I'll start looking in BP5Reader::ReadData() as well, since I think it will be helpful for me to have at least some familiarity with the read side.

eisenhauer · 2025-06-26T20:45:44Z

output.txt
Attaching a file with output from bpls with some extra diagnostic output. It looks like we're going to the right data file and the right place in the data file to read for timestep 0, writer rank 0.

(base) eisen@Endor build % bin/bpls -D -d unbalanced_output.bp/
  uint64_t  GlobalArray  2*{5, 15}
        step 0: 
          block 0: [0:4,  0: 0]
WritermapIndex for Timestep 0 is 0
Read from Timestep 0 WriterRank 0 Going to Subfilenum 1
Subfilenum 1 has name unbalanced_output.bp/data.1
DataPosition for Timestep 0 WriterRank 0 is 4096
Offset of Data in block is 0 Final offset in subfile 1 is 4096
    (0,0)    0 0 0 0 0

But we're not finding the data that should be there. You say that this works with writing just one timestep. Is it possible that writing the 2nd timestep is killing some of the data from the first?

eisenhauer · 2025-06-26T21:05:59Z

Whatever is going on, subfile 1 has zeroes up through Offset 12288, and that doesn't look right.

(base) eisen@Endor build % od -A x unbalanced_output.bp/data.1 
0000000    000000  000000  000000  000000  000000  000000  000000  000000
*
0003000    000130  000000  000000  000000  000147  000000  000000  000000
0003010    000166  000000  000000  000000  000205  000000  000000  000000
0003020    000224  000000  000000  000000  000000  000000  000000  000000
0003030

scottwittenburg · 2025-06-26T21:16:56Z

subfile 1 has zeroes up through Offset 12288, and that doesn't look right

Huh, that is interesting. There is initialization stuff that gets done on every timestep now (used to only be done once), and I'm just selectively guarding against doing some specific things within there. Maybe more bits need guarded against running 1) on a timestep after the first, or 2) in the case of a aggregator output file cache hit. Like seeking to position 0 in a file, for example.

I wrote a tiny standalone program just to read the file (mainly so I could run it with a single process), and then added some debug printing to the reader. I see a couple weird things off the bat:

single process reader debug output

rank 0 size 1
ReadRequest:   Timestep: 0  WriterRank: 4  StartOffset: 0  ReadLength: 200
ReadData SubFileNum: 0
ReadRequest:   Timestep: 0  WriterRank: 0  StartOffset: 0  ReadLength: 40
ReadData SubFileNum: 1
ReadRequest:   Timestep: 0  WriterRank: 3  StartOffset: 0  ReadLength: 160
ReadData SubFileNum: 1
ReadRequest:   Timestep: 0  WriterRank: 1  StartOffset: 0  ReadLength: 80
ReadData SubFileNum: 2
ReadRequest:   Timestep: 0  WriterRank: 2  StartOffset: 0  ReadLength: 120
ReadData SubFileNum: 2
Step data:
0 1 2 3 4 5 0 0 0 0 0 0 0 0 0 0 16 17 18 19 20 0 0 0 0 0 0 0 0 0 0 31 32 33 34 35 0 0 0 0 0 0 0 0 0 0 46 47 48 49 50 0 0 0 0 0 0 0 0 0 0 61 62 63 64 65 0 0 0 0 0 0 0 0 0 
ReadRequest:   Timestep: 1  WriterRank: 0  StartOffset: 0  ReadLength: 200
ReadData SubFileNum: 0
ReadRequest:   Timestep: 1  WriterRank: 3  StartOffset: 0  ReadLength: 160
ReadData SubFileNum: 1
ReadRequest:   Timestep: 1  WriterRank: 4  StartOffset: 0  ReadLength: 40
ReadData SubFileNum: 1
ReadRequest:   Timestep: 1  WriterRank: 1  StartOffset: 0  ReadLength: 80
ReadData SubFileNum: 2
ReadRequest:   Timestep: 1  WriterRank: 2  StartOffset: 0  ReadLength: 120
ReadData SubFileNum: 2
Step data:
74 75 76 77 78 79 80 81 82 83 0 0 0 0 88 89 90 91 92 93 94 95 96 97 98 0 0 0 0 103 104 105 106 107 108 109 110 111 112 113 0 0 0 0 118 119 120 121 122 123 124 125 126 127 128 0 0 0 0 133 134 135 136 137 138 139 140 141 142 143 0 0 0 0 148

It's odd to me that the start offset is 0 regardless of timestep or writer rank. It's also odd that three of the read lengths are off by 8 bytes (something I also see in bp5dbg output). Based on the amounts of data I debug printed on each rank, I would think 200 above should be 208, 120 should be 128, and 40 should be 48.

scottwittenburg · 2025-06-26T21:56:04Z

Hmmmm, I wonder: If rank X opens a file already opened and written by rankY, should it be doing that in append mode or else the file will get overwritten?

scottwittenburg · 2025-06-26T22:13:46Z

If rank X opens a file already opened and written by rankY, should it be doing that in append mode or else the file will get overwritten?

That was the problem, the hint from running od was the clue that eventually led to the solution. In 13d7f06, I have patched it so that the problem case now passes the test and produces the expected data!

eisenhauer · 2025-06-26T22:47:26Z

Great! Yeah, you may be running into filesystem oddities with this approach. Are you closing files after the write responsibility shifts? If not I can imagine that if the responsibility shifts back you might need to close and re-open to get a sane file pointer.

scottwittenburg · 2025-06-26T22:50:51Z

But we're not finding the data that should be there. You say that this works with writing just one timestep. Is it possible that writing the 2nd timestep is killing some of the data from the first?

Also that clue is huge 😆

Your help was invaluable in finding this bug I wrote, thank you @eisenhauer!

I still have 2 things to handle which you pointed out in previous comments:

two-level metadata aggregation: either metadata swizzling or else separate comm for metadata chains
guarding against Bad Things when doing multiple Put() within a timestep

scottwittenburg · 2025-06-26T22:57:30Z

Are you closing files after the write responsibility shifts? If not I can imagine that if the responsibility shifts back you might need to close and re-open to get a sane file pointer.

Currently each rank has a separate transport manager for every file it ends up writing to. I'm keeping each of those files open throughout the run, to avoid unnecessary Open() in case a rank ever switches back to a file it has already written to previously.

Does that seem like a bad idea? Maybe I need to seek to some location in that case? Or does WriteFileAt(DataVec.data(), DataVec.size(), m_StartDataPos) handle writing somewhere away from current fp?

eisenhauer · 2025-06-26T23:05:44Z

2. guarding against Bad Things when doing multiple Put() within a timestep

Just to clarify this one, multiple Put() shouldn't change anything for you. You still get a single contiguous data block. it's calls to PerformDataWrite() that would potentially mess you up. That would result in multiple calls to WriteData() in a single timestep.

scottwittenburg force-pushed the data-size-based-aggregation branch from 758fb60 to 1956453 Compare June 13, 2025 00:55

scottwittenburg commented Jun 17, 2025

View reviewed changes

scottwittenburg commented Jun 18, 2025

View reviewed changes

scottwittenburg force-pushed the data-size-based-aggregation branch from c9de252 to 4408611 Compare June 18, 2025 01:56

scottwittenburg commented Jun 25, 2025

View reviewed changes

source/adios2/engine/bp5/BP5Writer.cpp Show resolved Hide resolved

scottwittenburg added 29 commits July 3, 2025 11:38

fix compiler warnings, parameterize number of steps

99a161a

FIx more compiler warnings

b8c1e04

Handle steps with rank data size shuffling

0c7f5e5

fix unused variable warning

4a82b23

Try to get multiple time steps working

12e5cf2

Properly finish a debug print

93d9fe6

adjust some debug prints

ea016ac

Track data position for each substream/file between steps

75d07fa

fix compiler warning

6fc3380

Only update metadata file names on step 0

9e07962

Fix debug data loc info dump

edccc6b

Fix data-size based aggregator bug

51ddb1e

1. build mpi chains at most once per step, 2. comment some debugs

c2ec010

clang format

ceb6b0c

Synchronize opening files for data-size based aggregation

316af62

add back debug stmts, but guard by verbose check

422340b

clang format

48a37b2

Each rank reads/validates a block, instead of all

cfb8d0d

Add some profiling around dsb-specific operations

d4cd0bf

avoid long chains of ranks w/ no data

6229e50

clang format

26c295e

Use captured variable, add a comment

0a38aa0

refactor partitioning logic into the writer

03f854a

improve debug print management

d576b8b

Fix broken test

e174d3c

Fix compiler warning on windows

b7aefc2

remove some unnecessary changes

c41db8a

Clean up partitioning test

8bba19e

Ensure IO for write goes out of scope before read

9a1ca59

scottwittenburg force-pushed the data-size-based-aggregation branch from 62a70c4 to 9a1ca59 Compare July 3, 2025 18:54

Data size based aggregation #4548

Are you sure you want to change the base?

Data size based aggregation #4548

Uh oh!

Conversation

scottwittenburg commented Jun 12, 2025

Uh oh!

scottwittenburg commented Jun 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scottwittenburg Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scottwittenburg Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scottwittenburg Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scottwittenburg commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eisenhauer commented Jun 19, 2025

Uh oh!

scottwittenburg commented Jun 19, 2025

Uh oh!

scottwittenburg commented Jun 19, 2025

Uh oh!

eisenhauer commented Jun 19, 2025

Uh oh!

eisenhauer commented Jun 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

scottwittenburg commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eisenhauer commented Jun 26, 2025

Uh oh!

eisenhauer commented Jun 26, 2025

Uh oh!

scottwittenburg commented Jun 26, 2025

Uh oh!

eisenhauer commented Jun 26, 2025

Uh oh!

scottwittenburg commented Jun 26, 2025

Uh oh!

eisenhauer commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scottwittenburg commented Jun 26, 2025

Uh oh!

eisenhauer commented Jun 26, 2025

Uh oh!

eisenhauer commented Jun 26, 2025

Uh oh!

scottwittenburg commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scottwittenburg commented Jun 26, 2025

Uh oh!

scottwittenburg Jun 18, 2025 •

edited

Loading

scottwittenburg Jun 18, 2025 •

edited

Loading

scottwittenburg Jun 18, 2025 •

edited

Loading

scottwittenburg commented Jun 18, 2025 •

edited

Loading

scottwittenburg commented Jun 25, 2025 •

edited

Loading

eisenhauer commented Jun 26, 2025 •

edited

Loading

scottwittenburg commented Jun 26, 2025 •

edited

Loading