Skip to content

Fix synchronization in copy_from_torch() tests #391

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

aidanfnv
Copy link
Contributor

@aidanfnv aidanfnv commented Jul 30, 2025

Related to #113

The tests for copy_from_torch() that I added in #363 are causing intermittent failures the Windows CI tests with a high failure rate.
The failure appears to result specifically from the first partial copy in test_partial_torch_copy() and the copy in test_full_torch_copy(), both in test_buffer_views.py, where the results are zeros instead of the copied data.

I suspect this to be the result of a race condition. In #363 I was seeing multiple of the partial copies with zeroes until I added calls to sync_to_cuda() and sync_to_device() in between the copy_from_torch() calls, and I suspect that something (perhaps torch.randn(...).cuda()?) is not finished before the first copy but is finished before the second copy due to those sync calls from #363. Adding the same two sync calls before the first copy appears to cause a deadlock.

This PR replaces those sync calls with a different synchronization pattern, more like the one used in test_torchbuffers.py.
I have run the Windows CI checks 8 times on this PR without any failures.

@aidanfnv
Copy link
Contributor Author

It looks like either this reintroduces the CI hang I ran into #363, or that issue was and still is intermittent.

@aidanfnv aidanfnv changed the title Add synchronization before copy in copy_from_torch() tests Fix synchronization in copy_from_torch() tests Jul 30, 2025
@aidanfnv
Copy link
Contributor Author

Adding calls to device.sync_to_cuda() and device.sync_to_device() before the first torch copy results did indeed reintroduce the deadlock that I saw in my other PR, so now I am trying a different synchronization pattern, the one used in test_torchbuffers.py.
It does not deadlock, but I still need to test if the failures are intermittent or eliminated.

@aidanfnv aidanfnv marked this pull request as ready for review July 30, 2025 19:57
@aidanfnv aidanfnv requested a review from a team as a code owner July 30, 2025 19:57
@aidanfnv
Copy link
Contributor Author

Currently the CI seems to fail almost every other run, but with this PR I was able to run the CI checks 8 times without any failures.

@aidanfnv aidanfnv requested a review from ccummingsNV July 30, 2025 23:56
Copy link
Contributor

@ccummingsNV ccummingsNV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What concerns me is that if you have to do this in a test, are we suggesting users also have to do it or will experience dead locks / race conditions? Your sync calls are basically blocking the host until all gpu work is done - that doesn't seem to me to be a pattern we can have.

I don't think your double-sync approach in the first case is correct. You're effectively telling cuda it should wait until the device has finished, and the device to wait until cuda is finished. I can't picture the result in my mind but it doesn't feel right :)

In theory it should be:

  • torch work
  • sync to cuda
  • device work
  • sync to device

etc

however looking at it, the sync functions are based around submits, and Device::read_buffer_data doesn't utilize a submit. This would mean the sync_to_cuda would be ignored. that could be the cause of all our race conditions - I'll speak to simon.

In the meantime, I don't think we can add this. Maybe disable this test until we've thought through the details of this race condition - I think it's wide ranging and you've just made a test case that's especially nasty.

@ccummingsNV
Copy link
Contributor

Yeah:

void Device::sync_to_cuda(void* cuda_stream)
{
    // Signal fence from CUDA, wait for it on graphics queue.
    if (m_supports_cuda_interop) {
        SGL_CU_SCOPE(this);
        uint64_t signal_value = m_global_fence->update_signaled_value();
        m_cuda_semaphore->signal(signal_value, CUstream(cuda_stream));
        m_wait_global_fence = true;  <<<<<<< Just sets bool telling next submit to wait for the semaphore
    }
}

And

void Device::read_buffer_data(const Buffer* buffer, void* data, size_t size, size_t offset)
{
    SGL_CHECK_NOT_NULL(buffer);
    SGL_CHECK(offset + size <= buffer->size(), "Buffer read is out of bounds");
    SGL_CHECK_NOT_NULL(data);

    <<<<<< Goes straight to RHI, bypassing the submit process
    SLANG_RHI_CALL(m_rhi_device->readBuffer(buffer->rhi_buffer(), offset, size, data));
}

This would impact any api call that doesn't utilize slangpy submit, of which there are many. The fix would be to mimic what sync_to_device does, and immediately insert a fence wait into the gfx queue. Will take a look at this today and either do it, or feedback here how to do it.

@ccummingsNV
Copy link
Contributor

In fact, the same issue exists in reverse - sync_to_cuda only guaruntees the most recent submit is finished, so any operation that doesn't utilize submit will be bypassed.

Copy link
Contributor

@ccummingsNV ccummingsNV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race condition fixes should be in. I suggest getting this back to just the syncs that're needed and seeing if it works properly!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants