- pytorch : 1.5
- CUDA : 10.1
For only multi-gpu processing in single machine, you only need to clarify num-gpu
argument.
CUDA_VISIBLE_DEVICES='0,1' python -m src.tools.check_dist --num-gpu 2
For collective communication in pytorch, it needs to execute process in main machine. They automatically set main machine IP address and unused port number for TCP communication.
For main process, you must set machine-rank
to zero and num-machine
to the number of machines.
CUDA_VISIBLE_DEVICES='0,1' python -m src.tools.check_dist --num-gpu 2 --num-machine 2 --machine-rank 0
When you want to use a fixed port number, just clarify dist-port
argument.
CUDA_VISIBLE_DEVICES='0,1' python -m src.tools.check_dist --num-gpu 2 --num-machine 2 --machine-rank 0 --dist-port xxxxx
In other machines, you clarify machine-rank
and must set dist-ip
and dist-port
arguments which is the same with main machine values.
CUDA_VISIBLE_DEVICES='0,1' python -m src.tools.check_dist --num-gpu 2 --num-machine 2 --machine-rank 1 --dist-ip xxx.xxx.xxx.xxx --dist-port xxxxx
Examples for collective communication functions and training in single machine. It also can be executed in multi-machine settings.
CUDA_VISIBLE_DEVICES='0,1' python -m src.tools.gather --num-gpu 2
CUDA_VISIBLE_DEVICES='0,1' python -m src.tools.reduce --num-gpu 2
CUDA_VISIBLE_DEVICES='0,1' python -m src.tools.loader --num-gpu 2 --seed 0
CUDA_VISIBLE_DEVICES='0,1' python -m src.tools.train --num-gpu 2 --seed 0