-
Notifications
You must be signed in to change notification settings - Fork 22
Fix synchronization in copy_from_torch() tests #391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
It looks like either this reintroduces the CI hang I ran into #363, or that issue was and still is intermittent. |
Adding calls to |
Currently the CI seems to fail almost every other run, but with this PR I was able to run the CI checks 8 times without any failures. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What concerns me is that if you have to do this in a test, are we suggesting users also have to do it or will experience dead locks / race conditions? Your sync calls are basically blocking the host until all gpu work is done - that doesn't seem to me to be a pattern we can have.
I don't think your double-sync approach in the first case is correct. You're effectively telling cuda it should wait until the device has finished, and the device to wait until cuda is finished. I can't picture the result in my mind but it doesn't feel right :)
In theory it should be:
- torch work
- sync to cuda
- device work
- sync to device
etc
however looking at it, the sync functions are based around submits, and Device::read_buffer_data doesn't utilize a submit. This would mean the sync_to_cuda would be ignored. that could be the cause of all our race conditions - I'll speak to simon.
In the meantime, I don't think we can add this. Maybe disable this test until we've thought through the details of this race condition - I think it's wide ranging and you've just made a test case that's especially nasty.
Yeah:
And
This would impact any api call that doesn't utilize slangpy submit, of which there are many. The fix would be to mimic what sync_to_device does, and immediately insert a fence wait into the gfx queue. Will take a look at this today and either do it, or feedback here how to do it. |
In fact, the same issue exists in reverse - sync_to_cuda only guaruntees the most recent submit is finished, so any operation that doesn't utilize submit will be bypassed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Race condition fixes should be in. I suggest getting this back to just the syncs that're needed and seeing if it works properly!
Related to #113
The tests for
copy_from_torch()
that I added in #363 are causing intermittent failures the Windows CI tests with a high failure rate.The failure appears to result specifically from the first partial copy in
test_partial_torch_copy()
and the copy intest_full_torch_copy()
, both intest_buffer_views.py
, where the results are zeros instead of the copied data.I suspect this to be the result of a race condition. In #363 I was seeing multiple of the partial copies with zeroes until I added calls to
sync_to_cuda()
andsync_to_device()
in between thecopy_from_torch()
calls, and I suspect that something (perhapstorch.randn(...).cuda()
?) is not finished before the first copy but is finished before the second copy due to those sync calls from #363. Adding the same two sync calls before the first copy appears to cause a deadlock.This PR replaces those sync calls with a different synchronization pattern, more like the one used in
test_torchbuffers.py
.I have run the Windows CI checks 8 times on this PR without any failures.