Fix synchronization in copy_from_torch() tests #391

aidanfnv · 2025-07-30T16:56:38Z

Related to #113

The tests for copy_from_torch() that I added in #363 are causing intermittent failures the Windows CI tests with a high failure rate.
The failure appears to result specifically from the first partial copy in test_partial_torch_copy() and the copy in test_full_torch_copy(), both in test_buffer_views.py, where the results are zeros instead of the copied data.

I suspect this to be the result of a race condition. In #363 I was seeing multiple of the partial copies with zeroes until I added calls to sync_to_cuda() and sync_to_device() in between the copy_from_torch() calls, and I suspect that something (perhaps torch.randn(...).cuda()?) is not finished before the first copy but is finished before the second copy due to those sync calls from #363. Adding the same two sync calls before the first copy appears to cause a deadlock.

This PR replaces those sync calls with a different synchronization pattern, more like the one used in test_torchbuffers.py.
I have run the Windows CI checks 8 times on this PR without any failures.

aidanfnv · 2025-07-30T17:22:54Z

It looks like either this reintroduces the CI hang I ran into #363, or that issue was and still is intermittent.

aidanfnv · 2025-07-30T18:26:42Z

Adding calls to device.sync_to_cuda() and device.sync_to_device() before the first torch copy results did indeed reintroduce the deadlock that I saw in my other PR, so now I am trying a different synchronization pattern, the one used in test_torchbuffers.py.
It does not deadlock, but I still need to test if the failures are intermittent or eliminated.

aidanfnv · 2025-07-30T23:56:20Z

Currently the CI seems to fail almost every other run, but with this PR I was able to run the CI checks 8 times without any failures.

ccummingsNV

What concerns me is that if you have to do this in a test, are we suggesting users also have to do it or will experience dead locks / race conditions? Your sync calls are basically blocking the host until all gpu work is done - that doesn't seem to me to be a pattern we can have.

I don't think your double-sync approach in the first case is correct. You're effectively telling cuda it should wait until the device has finished, and the device to wait until cuda is finished. I can't picture the result in my mind but it doesn't feel right :)

In theory it should be:

torch work
sync to cuda
device work
sync to device

etc

however looking at it, the sync functions are based around submits, and Device::read_buffer_data doesn't utilize a submit. This would mean the sync_to_cuda would be ignored. that could be the cause of all our race conditions - I'll speak to simon.

In the meantime, I don't think we can add this. Maybe disable this test until we've thought through the details of this race condition - I think it's wide ranging and you've just made a test case that's especially nasty.

ccummingsNV · 2025-07-31T08:53:07Z

Yeah:

void Device::sync_to_cuda(void* cuda_stream)
{
    // Signal fence from CUDA, wait for it on graphics queue.
    if (m_supports_cuda_interop) {
        SGL_CU_SCOPE(this);
        uint64_t signal_value = m_global_fence->update_signaled_value();
        m_cuda_semaphore->signal(signal_value, CUstream(cuda_stream));
        m_wait_global_fence = true;  <<<<<<< Just sets bool telling next submit to wait for the semaphore
    }
}

And

void Device::read_buffer_data(const Buffer* buffer, void* data, size_t size, size_t offset)
{
    SGL_CHECK_NOT_NULL(buffer);
    SGL_CHECK(offset + size <= buffer->size(), "Buffer read is out of bounds");
    SGL_CHECK_NOT_NULL(data);

    <<<<<< Goes straight to RHI, bypassing the submit process
    SLANG_RHI_CALL(m_rhi_device->readBuffer(buffer->rhi_buffer(), offset, size, data));
}

This would impact any api call that doesn't utilize slangpy submit, of which there are many. The fix would be to mimic what sync_to_device does, and immediately insert a fence wait into the gfx queue. Will take a look at this today and either do it, or feedback here how to do it.

ccummingsNV · 2025-07-31T08:57:04Z

In fact, the same issue exists in reverse - sync_to_cuda only guaruntees the most recent submit is finished, so any operation that doesn't utilize submit will be bypassed.

ccummingsNV

Race condition fixes should be in. I suggest getting this back to just the syncs that're needed and seeing if it works properly!

Add synchronization before copy in copy_from_torch() tests

f69d0c7

Use similar sync as test_torchbuffers

10e1b40

aidanfnv changed the title ~~Add synchronization before copy in copy_from_torch() tests~~ Fix synchronization in copy_from_torch() tests Jul 30, 2025

Formatting

561df6d

aidanfnv marked this pull request as ready for review July 30, 2025 19:57

aidanfnv requested a review from a team as a code owner July 30, 2025 19:57

aidanfnv requested a review from ccummingsNV July 30, 2025 23:56

ccummingsNV reviewed Jul 31, 2025

View reviewed changes

ccummingsNV requested changes Jul 31, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix synchronization in copy_from_torch() tests #391

Fix synchronization in copy_from_torch() tests #391

aidanfnv commented Jul 30, 2025 •

edited

Loading

Uh oh!

aidanfnv commented Jul 30, 2025

Uh oh!

aidanfnv commented Jul 30, 2025

Uh oh!

aidanfnv commented Jul 30, 2025

Uh oh!

ccummingsNV left a comment

Uh oh!

ccummingsNV commented Jul 31, 2025

Uh oh!

ccummingsNV commented Jul 31, 2025

Uh oh!

ccummingsNV left a comment

Uh oh!

Uh oh!

Fix synchronization in copy_from_torch() tests #391

Are you sure you want to change the base?

Fix synchronization in copy_from_torch() tests #391

Conversation

aidanfnv commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aidanfnv commented Jul 30, 2025

Uh oh!

aidanfnv commented Jul 30, 2025

Uh oh!

aidanfnv commented Jul 30, 2025

Uh oh!

ccummingsNV left a comment

Choose a reason for hiding this comment

Uh oh!

ccummingsNV commented Jul 31, 2025

Uh oh!

ccummingsNV commented Jul 31, 2025

Uh oh!

ccummingsNV left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aidanfnv commented Jul 30, 2025 •

edited

Loading