Open
Description
The failure on #439 is due to CUDA synchronization issues, From what I understand, the changes in JuliaGPU/CUDA.jl#395 mean that streams are no longer globally synchronized. Since Open MPI operates its own streams, this means that operations can potentially overlap (open-mpi/ompi#7733).
It seems like we need to call CUDA.synchronize()
(or perhaps CUDA.synchronize(CuStreamPerThread())
?) before calling MPI?