-
Notifications
You must be signed in to change notification settings - Fork 134
Data size based aggregation #4548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Data size based aggregation #4548
Conversation
fyi @pnorbert and @eisenhauer This is my strawman for data-size based aggregation. Over the next couple days I plan to comment in various locations in the changeset to ask questions about direction this approach is taking, with the goal of getting your feedback. |
758fb60
to
1956453
Compare
useProfiler, *DataWritingComm); | ||
} | ||
|
||
// If we got a cache hit above, we'd like to skip the enclosed OpenFiles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one of the places where I tried to avoid opening files in the case of a cache hit, but it results in deadlock.
@@ -121,6 +125,12 @@ bp_gtest_add_tests_helper(WriteReadAttributesMultirank MPI_ALLOW) | |||
bp_gtest_add_tests_helper(LargeMetadata MPI_ALLOW) | |||
bp5_gtest_add_tests_helper(WriteStatsOnly MPI_ALLOW) | |||
|
|||
if (ADIOS2_HAVE_MPI) | |||
gtest_add_tests_helper(DataSizeAggregate MPI_ONLY BP Engine.BP. .BP5.DSB | |||
WORKING_DIRECTORY ${BP5_DIR} EXTRA_ARGS "DataSizeBased" 2 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The extra args passed here are the aggregation type, the desired number of subfiles, and the number of time steps to iterate. Run with any number of subfiles and a single time step, the test passes. Run with 2 timesteps results in failure when trying to read the data back in.
2 steps failure details
$ /opt/mpi/mpich/4.1.2/install/bin/mpiexec "-n" "4" "/home/local/KHQ/scott.wittenburg/projects/EfficientInSituIO/fides_adios/adios2/build/bin/Test.Engine.BP.DataSizeAggregate.MPI" "--gtest_filter=DSATest.TestWriteUnbalancedData" "DataSizeBased" "2" "2"
Note: Google Test filter = DSATest.TestWriteUnbalancedData
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from DSATest
[ RUN ] DSATest.TestWriteUnbalancedData
Note: Google Test filter = DSATest.TestWriteUnbalancedData
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from DSATest
[ RUN ] DSATest.TestWriteUnbalancedData
Note: Google Test filter = DSATest.TestWriteUnbalancedData
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from DSATest
[ RUN ] DSATest.TestWriteUnbalancedData
Note: Google Test filter = DSATest.TestWriteUnbalancedData
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from DSATest
[ RUN ] DSATest.TestWriteUnbalancedData
Number of timesteps: 2
Aggregation type: DataSizeBased
Number of subfiles: 2
Rank data sizes (in number of columns)
1 2 3 4
--- Constructing BP5Writer (0)---
--- Constructing BP5Writer (2)---
--- Constructing BP5Writer (3)---
BP5Writer::0::BeginStep(StepMode::Append, -1)
BP5Writer::2::BeginStep(StepMode::Append, -1)
BP5Writer::3::BeginStep(StepMode::Append, -1)
--- Constructing BP5Writer (1)---
BP5Writer::0::EndStep()
BP5Writer::2::EndStep()
BP5Writer::0::WriteData()
InitAggregator() - DataSizeBased: Closing and re-opening MPIChain
BP5Writer::2::WriteData()
InitAggregator() - DataSizeBased: Closing and re-opening MPIChain
BP5Writer::1::BeginStep(StepMode::Append, -1)
BP5Writer::3::EndStep()
BP5Writer::1::EndStep()
BP5Writer::3::WriteData()
InitAggregator() - DataSizeBased: Closing and re-opening MPIChain
BP5Writer::1::WriteData()
InitAggregator() - DataSizeBased: Closing and re-opening MPIChain
Rank data sizes: [32, 64, 96, 128Paritioning resulted in 2 substreams:
0: [3 0 ], partition data size: 160
1: [2 1 ], partition data size: 160
Rank 0 - A
Rank 0 - B
Rank Rank 1 - A
Rank 1 - B
Rank 1 - D
Rank 1 - F
Rank 2 - A
Rank 2 - B
Rank 2 - D
Rank 2 - F
Rank 3 - A
Rank 3 - B
Rank 3 - D
Rank 3 - F
0 - D
Rank 0 - E
Rank 0 - F
Rank 0 - G
Rank 0 - I
Rank 1 - G
Rank 1 - I
Rank 1 - J
Rank Rank 2 - G
Rank 2 - I
Rank 2 - J
Rank 2 - K
Rank 2 - L
Rank 3 - G
Rank 3 - I
Rank 3 - J
Rank 3 - K
Rank 3 - L
Rank 0 - J
Rank 0 - K
Rank 0 - L
1 - K
Rank 1 - L
Rank 0 - M
Rank 0 - N
Rank 0 - O
Rank 1 - M
Rank 1 - N
Rank 1 InitBPBuffer()
Rank 1 new writer needed, generate now, write later
Rank 2 - M
Rank 2 - N
Rank 2 InitBPBuffer()
Rank 3 - M
Rank 3 - N
Rank 3 InitBPBuffer()
Rank 3 new writer needed, generate now, write later
Rank 2 new writer needed, generate now, write later
Rank 0 InitBPBuffer()
Rank 0 new writer needed, generate now, write later
BP5Writer::1::BeginStep(StepMode::Append, -1)
BP5Writer::2::BeginStep(StepMode::Append, -1)
BP5Writer::3::BeginStep(StepMode::Append, -1)
BP5Writer::0::WriteMetaMetadata()
BP5Writer::0::WriteMetadata() OVERLOAD2
BP5Writer::1::EndStep()
BP5Writer::2::EndStep()
BP5Writer::0::WriteMetadataFileIndex()
BP5Writer::1::WriteData()
BP5Writer::2::WriteData()
BP5Writer::3::EndStep()
BP5Writer::3::WriteData()
InitAggregator() - DataSizeBased: Closing and re-opening MPIChain
InitAggregator() - DataSizeBased: Closing and re-opening MPIChain
InitAggregator() - DataSizeBased: Closing and re-opening MPIChain
BP5Writer::0::BeginStep(StepMode::Append, -1)
BP5Writer::0::EndStep()
BP5Writer::0::WriteData()
InitAggregator() - DataSizeBased: Closing and re-opening MPIChain
Rank data sizes: [128, 64, 96, 32Paritioning resulted in 2 substreams:
0: [0 3 ], partition data size: 160
1: [2 1 ], partition data size: 160
Rank 1 cache hit for aggregator key 1-2-1-0
Rank 2 cache hit for aggregator key 1-2-0-1
Rank 2 - A
Rank 2 - B
Rank 2 - D
Rank 2 - F
Rank 0 - A
Rank 0 - B
Rank 0 - D
Rank 0 - E
Rank 1 - A
Rank 1 - B
Rank 1 - D
Rank 1 - F
Rank 3 - A
Rank 3 - B
Rank 3 - D
Rank 3 - F
Rank 0 - F
Rank 0 - G
Rank 0 - I
Rank 2 - G
Rank 2 - I
Rank 2 - J
Rank 2 - K
Rank 2 - L
Rank 3 - G
Rank 3 - I
Rank 3 - J
Rank 3 - K
Rank 3 - L
Rank 0 - J
Rank 0 - K
Rank 0 - L
Rank 1 - G
Rank 1 - I
Rank 1 - J
Rank 1 - K
Rank 1 - L
Rank 0 - M
Rank 0 - N
Rank 0 - O
Rank 3 - M
Rank 3 - N
Rank 3 InitBPBuffer()
Rank 3 new writer needed, generate now, write later
Rank 1 - M
Rank 1 - N
Rank 1 InitBPBuffer()
Rank 1 new writer needed, generate now, write later
Rank 2 - M
Rank 2 - N
Rank 2 InitBPBuffer()
Rank 2 new writer needed, generate now, write later
Rank 0 InitBPBuffer()
Rank 0 new writer needed, generate now, write later
BP5Writer::0::WriteMetaMetadata()
BP5Writer::1::DoClose()
BP5Writer::2::DoClose()
BP5Writer::0::WriteMetadata() OVERLOAD2
BP5Writer::3::DoClose()
BP5Writer::0::WriteMetadataFileIndex()
BP5Writer::0::DoClose()
Finished writing unbalanced data, reading it back
unknown file: Failure
C++ exception with description "[Tue Jun 17 17:29:40 2025] [ADIOS2 EXCEPTION] <Engine> <BP5Reader> <ParseMetadataIndex> : ADIOS2 BP5 Engine only supports bp format version 5, found 0 version
" thrown in the test body.
[ FAILED ] DSATest.TestWriteUnbalancedData (11 ms)
unknown file: Failure
C++ exception with description "[Tue Jun 17 17:29:40 2025] [ADIOS2 EXCEPTION] <Engine> <BP5Reader> <ParseMetadataIndex> : ADIOS2 BP5 Engine only supports bp format version 5, found 0 version
" thrown in the test body.
[ FAILED ] DSATest.TestWriteUnbalancedData (11 ms)
[----------] 1 test from DSATest (11 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (11 ms total)
[ PASSED ] 0 tests.
[ FAILED ] unknown file: Failure
C++ exception with description "[Tue Jun 17 17:29:40 2025] [ADIOS2 EXCEPTION] <Engine> <BP5Reader> <ParseMetadataIndex> : ADIOS2 BP5 Engine only supports bp format version 5, found 0 version
" thrown in the test body.
[ FAILED ] DSATest.TestWriteUnbalancedData (11 ms)
[----------] 1 test from DSATest (11 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (11 ms total)
[ PASSED ] 0 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] DSATest.TestWriteUnbalancedData
1 FAILED TEST
unknown file: Failure
C++ exception with description "[Tue Jun 17 17:29:40 2025] [ADIOS2 EXCEPTION] <Engine> <BP5Reader> <ParseMetadataIndex> : ADIOS2 BP5 Engine only supports bp format version 5, found 0 version
" thrown in the test body.
[ FAILED ] DSATest.TestWriteUnbalancedData (11 ms)
[----------] 1 test from DSATest (11 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (11 ms total)
[ PASSED ] 0 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] DSATest.TestWriteUnbalancedData
1 FAILED TEST
[----------] 1 test from DSATest (11 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (11 ms total)
[ PASSED ] 0 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] DSATest.TestWriteUnbalancedData
1 FAILED TEST
1 test, listed below:
[ FAILED ] DSATest.TestWriteUnbalancedData
1 FAILED TEST
You might notice in that case that I ran with 4 processes, and that the greedy partitioning strategy as well as my naive choice of cache key played a role in two ranks having a cache hit on the 2nd time step, and the other two having cache misses. So while all ranks write to the same file for t=2
as they did for t=1
, because the partitioning strategy re-ordered the ranks in one of the substreams, and because the cache key includes the substream rank order, those two ranks think they need to open new files.
Also note the only reason there is no deadlock when running for 2 timesteps, is because currently the InitTransports()
method is doing all the normal init steps, regardless of whether there was a cache hit or not, specifically to avoid deadlocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the bp produces when running with 2 timesteps, bpls
produces output that agrees with the testing code:
bpls output
$ ~/projects/EfficientInSituIO/fides_adios/adios2/install/bin/bpls -d -s "0,0" -c "-1,-1" ./unbalanced_output.bp
Failed to open with BPFile engine: [Tue Jun 17 17:31:35 2025] [ADIOS2 EXCEPTION] <Engine> <BP5Reader> <ParseMetadataIndex> : ADIOS2 BP5 Engine only supports bp format version 5, found 0 version
Error: Could not open this file with any ADIOS2 file reading engines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And bp5dbg
has an even harder time with the 2-timestep output:
bp5dbg output
$ PYTHONPATH=~/projects/EfficientInSituIO/fides_adios/adios2/install/local/lib/python3.10/dist-packages ~/projects/EfficientInSituIO/fides_adios/adios2/install/bin/bp5dbg ./unbalanced_output.bp
========================================================
Index Table File: ./unbalanced_output.bp/md.idx
========================================================
------------------------------------------------------------------------------------------------------------------
| Version string |Major|Minor|Patch|unused|Endian|BP version|BP minor|Active|ColumnMajor|unused|
| 32 bytes | 1B | 1B | 1B | 1B | 1B | 1B | 1B | 1B | 1B | 23B |
+----------------------------------+-----+-----+-----+------+------+----------+--------+------+-----------+------+
| | | | | | yes | 0 | 0 | no | yes | |
------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [0]
Unknown record , lenght = [0]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [0]
Unknown record , lenght = [0]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [0]
Unknown record , lenght = [0]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [0]
Unknown record , lenght = [0]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [0]
Unknown record , lenght = [0]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [0]
Unknown record , lenght = [0]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [0]
Unknown record , lenght = [0]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [0]
Unknown record , lenght = [0]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [0]
Unknown record , lenght = [0]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [0]
Unknown record , lenght = [0]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [0]
Unknown record , lenght = [0]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [0]
Unknown record , lenght = [0]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [0]
Unknown record , lenght = [0]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [0]
Unknown record , lenght = [0]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [242514657280]
Unknown record , lenght = [242514657280]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [67108864]
Unknown record , lenght = [67108864]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [0]
Unknown record , lenght = [0]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [512]
Unknown record , lenght = [512]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [0]
Unknown record , lenght = [0]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [72057594037927936]
Unknown record , lenght = [72057594037927936]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [0]
Unknown record , lenght = [0]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [15889042532990976]
Unknown record , lenght = [15889042532990976]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [1099511627776000]
Unknown record , lenght = [1099511627776000]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [4294967296000]
Unknown record , lenght = [4294967296000]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [0]
Unknown record , lenght = [0]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [536870912]
Unknown record , lenght = [536870912]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [3145728]
Unknown record , lenght = [3145728]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Record '', length = [8192]
Unknown record , lenght = [8192]
--------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------
Traceback (most recent call last):
File "/home/local/KHQ/scott.wittenburg/projects/EfficientInSituIO/fides_adios/adios2/install/bin/bp5dbg", line 155, in <module>
status = DumpIndexTableFile(args)
File "/home/local/KHQ/scott.wittenburg/projects/EfficientInSituIO/fides_adios/adios2/install/bin/bp5dbg", line 93, in DumpIndexTableFile
status, MetadataIndexTable, WriterMap = DumpIndexTable(
File "/home/local/KHQ/scott.wittenburg/projects/EfficientInSituIO/fides_adios/adios2/install/local/lib/python3.10/dist-packages/adios2/bp5dbg/idxtable.py", line 156, in DumpIndexTable
status, MetadataIndexTable, WriterMap = ReadIndex(
File "/home/local/KHQ/scott.wittenburg/projects/EfficientInSituIO/fides_adios/adios2/install/local/lib/python3.10/dist-packages/adios2/bp5dbg/idxtable.py", line 64, in ReadIndex
reclen = np.frombuffer(table, dtype=np.uint64, count=1,
ValueError: buffer is smaller than requested size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And finally, note that I think the test logic itself is more or less sound for 2 timesteps, because if you provide EveryoneWritesSerial
as the first argument, the test passes, and bpls
/ bp5dbg
produce reasonable-looking (to me) output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears to me from the output of the test / bpls
/ bp5dbg
, that the error is coming from here and thus possibly caused by some problem in the metadata index file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And to make sure I understand, you would only expect this problem to rear its head when we exceed
OneLevelGatherRanksLimit
?
Yes, I believe that that's true...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm mulling those first two options. Upon first consideration, option 1 seems more attractive since it seems easier to implement (but maybe I just haven't seen far enough along that path yet).
Yes, I believe that that's true...
And when it happens, how do you expect it would manifest itself? I'm just looking for a rough idea here, like would it segfault when writing, or just produce data that can't be read back in correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a segfault when writing, but definitely bad data on read (unless you got lucky). That "per rank metadata size" array is gathered in a one-level gather, so the size values will be in a canonical order. It's then used to create the per-rank size values in the subsequent gathervs, be they one or two level. If we end up associating a size with the wrong rank we'll gather trash data, either too little from one rank or too much from another. The latter might be a segfault if it was unmapped memory, but even if it didn't blow up on the writer, we'd be calculating the wrong start positions for metadata on the reader and blow up when trying to parse. (Actually, you'd likely get an FFS error about unknown formats because it wouldn't find the right header when starting the parse...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going back to consider the two-level metadata aggregation, and want to get more clarification on the problem in general, and one of those approaches you mentioned specifically.
Generally, I'm not sure I see in the code how the order of the initial data aggregator/mpichains has anything to do with the metadata aggregator/mpichains. They are currently separate aggregators, and as far as I can see, their only connection to each other is that they both have MPI_COMM_WORLD
(or whatever communicator is passed to the writer's constructor) as their parent communicator.
And regarding option 1 specifically, you said:
First, we could separate the aggregation communicators used for data from that used for metadata. There is no logical relationship between those operations, so by sticking with the original plan for metadata communication we avoid all these issues. (And I don't think there's overhead associated with having multiple subcommunicators.).
When you say "original plan" here, do you mean how metadata is aggregated before this PR? Or is it rather something we discussed in a meeting? So far in this PR I think I tried not to touch how metadata is aggregated, so is it possible this is already a non-issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If those aggregators are disjoint, then it's a non-issue. I was under the impression that we were using the same aggregators for both data and metadata aggregation when metadata aggregation was multi-level. If we're not, then no problems.
c9de252
to
4408611
Compare
With my current implementation, I have found a broken case when 3 of 5 ranks get an aggregator cache hit, while 2 of 5 do not. I'll try to describe the problem by first showing the expected outputs when the aggregation type is In both cases I run with 5 mpi processes, 3 subfiles, and 2 timesteps. First look at the data produced when using EveryoneWritesSerial - bpls
EveryoneWritesSerial - bp5dbg
Now look at the data produced when using DataSizeBased - bpls
DataSizeBased - bp5dbg
Note when reading the output above that the size based partitioning produced at Any idea where I should be looking to correct that problem? Seems like it's in the metadata index file, but beyond that, I'm not sure. Also, it appears to me that the other writer data locations for timestep 2 are correct, which is confusing, unless I'm interpreting the data wrong. One last note: when repeating the above tests with only one time step, the test passes in both cases, and the data looks fine To me that seems to indicate the problem is related to repeating some of the initialization in |
Hmm. A quick question. Are you using SelectiveAggregationMetadata? I.E. is the UseSelectiveMetadataAggregation engine parameter in its default "true" value and the first branch of this if in BP5Writer being taken?
Because if not, if you're falling into TwoLevelAggregationMetadata, then all the unfortunate effects that I might have anticipated for >6K nodes are hitting you now... |
Ah yes, great question, I assume you asked because you noticed the re-ordering between timesteps. I just confirmed that in this case, it's taking the |
The aggregator rank in each mpi chain receives the final data position from the last rank in the chain. Is that because there is an assumption there that the aggregator rank is going to be writing to the same file the next time around? |
Just wanted to make sure, because the TwoLevel stuff uses those aggregators and if they're changing, potentially bad things. Also, as a debugging aid, make sure this |
Something else to note while I'm thinking about it. You'll notice that WriteData() is also being called in FlushData(). FlushData is a user-visible BP5 call that allows the application to write partial timestep data in the middle of a timestep (I.E. before EndStep()). This is useful if you are pushing memory limits and don't have room to buffer. So, for example you could do BeginStep(),Put(),Put(),FlushData(),Put(),Put(),FlushData(),Put(),EndStep(). Each of those flushes and the end step results in data (but not metadata) going to disk. The FlushPosSizeInfo vector on rank 0 accumulates all the info from each rank for where it's data landed and how much data it was on each Flush. Note that as a natural consequence of writing piece-wise the data from rank X is not contiguous in the data file, but broken up into multiple segments. In order to find the data at Offset M, timestep T, from rank N, we have to have the offset and size of each chunk, so this all ends up in a single metadataindex record for that timestep. The relevant bit for you is that WriterMapRecords only occur between timesteps. Which means that we don't have the ability to let a rank change what file it is writing to in the middle of a timestep. So, the first time WriteData() is called in a timestep (which might be in a FlushData or in EndStep), you are free to change the WriterSubfileMap. But you should not change it again in that timestep. You're probably not playing with those examples yet, but this is something that will require a little bit of extra logic to make sure it doesn't happen. |
// Stuff we only want to do on the first timestep | ||
m_SubstreamDataPos.resize(m_Parameters.NumSubFiles); | ||
} | ||
m_CommAggregators = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appeared to me that m_CommAggregators
was split off from the world, and then never used anywhere, so I decided to re-purpose it. Hopefully that's acceptable and I didn't misread the code.
By the way, on the Extra debug output:
For subsequent time steps, it seems to print nothing instead of those characters from the buffer. I wonder if that block didn't get updated when some other bit of the code did, due to it not getting exercised and validated? |
Ah, yes, that's probably the case. Those items are size_t and maybe the type of buf changed. Which may mean those index calcs into it are off too. I can't look tonight, but will check tomorrow. |
OK, try substituting in this code into the matching bits of WriteMetadataFileIndex()
(I won't put these changes into a PR now because that would just make merging your code more difficult when it comes to that.) |
Thanks Greg, that works and jives with both the extra debug printing I was doing, as well as with what So I'm just curious if this summary below 1) contains all the relevant information you would expect is needed to indicate correct writing of the data, and 2) if it actually looks correct to you based on the writer chains at each time step (it looks correct to me). Debug output showing writer chains and metadata
And here is the comparison of what is produced by Side-by-side bpls output
Let me know if you spot something that looks wrong, or if I haven't included complete details. |
The metadatafile index stuff does look like what I'd expect. However, things obviously aren't right even with just a single timestep, so something is clearly going badly. I'd like to try to reproduce this output and would be hopeful that I if I step through with the debugger on the reader side I might see what the issue is. Can you tell me what/how you're running on the writer side so I can give that a go? |
Well, if I run the test with just a single timestep, everything looks fine and works. But when I run the test with two timesteps, somehow the first timestep get's messed up in retrospect or something.
😅 That's for sure!
That would be so great, thank you! I'm just running variations of one of the new tests I added, which you can run with all the defaults as:
The
Many combinations of parameters don't result in failure or messed up output file like the example above does. I suspect it's probably related to how some of the ranks change the subfile they're writing between steps. You can try with 1 timestep instead of 2, and should see that it works. Note: I need to |
OK, I'll give this a go. I've got a couple of calls yet this afternoon, so it won't be right away. Just FYI, in case you get there first, my plan is to look at what is happening in BP5Reader::ReadData(). The BP5Deserializer looks at the Get()s that happen in a timestep and translate those into a ReadList of data that it needs. That list looks like : from WriterRank W data from Timestep T, read Length bytes from Offset S. ReadData()'s responsibility is to translate that into actual data access. One of the first things it does is to translate W and T to a SubfileNum. Looking at what's going wrong there should be informative. |
Ok thanks for that information, I'll start looking in |
output.txt
But we're not finding the data that should be there. You say that this works with writing just one timestep. Is it possible that writing the 2nd timestep is killing some of the data from the first? |
Whatever is going on, subfile 1 has zeroes up through Offset 12288, and that doesn't look right.
|
Huh, that is interesting. There is initialization stuff that gets done on every timestep now (used to only be done once), and I'm just selectively guarding against doing some specific things within there. Maybe more bits need guarded against running 1) on a timestep after the first, or 2) in the case of a aggregator output file cache hit. Like seeking to position 0 in a file, for example. I wrote a tiny standalone program just to read the file (mainly so I could run it with a single process), and then added some debug printing to the reader. I see a couple weird things off the bat: single process reader debug output
It's odd to me that the start offset is 0 regardless of timestep or writer rank. It's also odd that three of the read lengths are off by 8 bytes (something I also see in |
Hmmmm, I wonder: If rank X opens a file already opened and written by rankY, should it be doing that in append mode or else the file will get overwritten? |
That was the problem, the hint from running |
Great! Yeah, you may be running into filesystem oddities with this approach. Are you closing files after the write responsibility shifts? If not I can imagine that if the responsibility shifts back you might need to close and re-open to get a sane file pointer. |
Also that clue is huge 😆 Your help was invaluable in finding this bug I wrote, thank you @eisenhauer! I still have 2 things to handle which you pointed out in previous comments:
|
Currently each rank has a separate transport manager for every file it ends up writing to. I'm keeping each of those files open throughout the run, to avoid unnecessary Does that seem like a bad idea? Maybe I need to seek to some location in that case? Or does |
Just to clarify this one, multiple Put() shouldn't change anything for you. You still get a single contiguous data block. it's calls to PerformDataWrite() that would potentially mess you up. That would result in multiple calls to WriteData() in a single timestep. |
62a70c4
to
9a1ca59
Compare
Goal: Add a new aggregation approach on the writer side, where ranks are grouped into MPIChains of roughly equal data size, for writing.
BP5Writer
to collect some member fields into a new structure containing the things that should be cached. The purpose of this is to avoid opening the same files repeatedly for subsequent time steps, when it can be avoided.BP5Writer
initialization logic to support data-size based aggregation (which is off by default and enabled with a configuration setting).