Multi Machine Tutorial for Pytorch

Requirements

pytorch : 1.5
CUDA : 10.1

Multi GPU in Single Machine

For only multi-gpu processing in single machine, you only need to clarify num-gpu argument.

CUDA_VISIBLE_DEVICES='0,1' python -m src.tools.check_dist --num-gpu 2

Multi Machines

Main Machine

For collective communication in pytorch, it needs to execute process in main machine. They automatically set main machine IP address and unused port number for TCP communication.

For main process, you must set machine-rank to zero and num-machine to the number of machines.

CUDA_VISIBLE_DEVICES='0,1' python -m src.tools.check_dist --num-gpu 2 --num-machine 2 --machine-rank 0

When you want to use a fixed port number, just clarify dist-port argument.

CUDA_VISIBLE_DEVICES='0,1' python -m src.tools.check_dist --num-gpu 2 --num-machine 2 --machine-rank 0 --dist-port xxxxx

Other Machines

In other machines, you clarify machine-rank and must set dist-ip and dist-port arguments which is the same with main machine values.

CUDA_VISIBLE_DEVICES='0,1' python -m src.tools.check_dist --num-gpu 2 --num-machine 2 --machine-rank 1 --dist-ip xxx.xxx.xxx.xxx --dist-port xxxxx

Tools

Examples for collective communication functions and training in single machine. It also can be executed in multi-machine settings.

Gather

CUDA_VISIBLE_DEVICES='0,1' python -m src.tools.gather --num-gpu 2

Reduce

CUDA_VISIBLE_DEVICES='0,1' python -m src.tools.reduce --num-gpu 2

Data Loader

CUDA_VISIBLE_DEVICES='0,1' python -m src.tools.loader --num-gpu 2 --seed 0

Train

CUDA_VISIBLE_DEVICES='0,1' python -m src.tools.train --num-gpu 2 --seed 0

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
img		img
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Multi Machine Tutorial for Pytorch

Requirements

Multi GPU in Single Machine

Multi Machines

Main Machine

Other Machines

Tools

Gather

Reduce

Data Loader

Train

About

Uh oh!

Releases 3

Packages

Languages

major196512/multi-machine-tutorial.pytorch

Folders and files

Latest commit

History

Repository files navigation

Multi Machine Tutorial for Pytorch

Requirements

Multi GPU in Single Machine

Multi Machines

Main Machine

Other Machines

Tools

Gather

Reduce

Data Loader

Train

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages