Skip to content

Commit 3e65394

Browse files
author
Choi TaeHo
authored
Update README.md
1 parent 6b2d4a9 commit 3e65394

File tree

1 file changed

+21
-13
lines changed

1 file changed

+21
-13
lines changed

README.md

Lines changed: 21 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,38 +1,46 @@
11
<p align="center"><img width="40%" src="./img/pytorch.png"></p>
22

3-
## Multi Machine Tutorial for Pytorch
4-
xxxxx
3+
# Multi Machine Tutorial for Pytorch
4+
It works TCP communication for multi-gpu processing.
5+
They automatically find unused port address and
56

67
## Requirements
78
* pytorch : 1.5
89
* CUDA : 10.1
910

10-
## Base
11-
12-
### Single Machine
11+
## Multi GPU in Single Machine
12+
For only multi-gpu processing in single machine, you only need to clarify `num-gpu` argument.
1313
```bash
1414
CUDA_VISIBLE_DEVICES='0,1' python -m src.tools.check_dist --num-gpu 2
1515
```
1616

17+
18+
## Multi Machines
19+
### Main Machine
20+
[For collective communication](https://tutorials.pytorch.kr/intermediate/dist_tuto.html#collective-communication) in pytorch, it needs to execute process in main machine.
21+
They automatically set main machine ip address and unused port number for TCP communication.
22+
`num-machine`
23+
set `machine-rank` to zero.
1724
```bash
18-
CUDA_VISIBLE_DEVICES='0,1' python -m src.tools.check_dist --num-gpu 2 --dist_port 47515
25+
CUDA_VISIBLE_DEVICES='0,1' python -m src.tools.check_dist --num-gpu 2 --num-machine 2 --machine-rank 0
1926
```
2027

21-
### Multi Machines
22-
* Main Machine
23-
28+
When you want to use a fixed port number, just clarify `dist-port` argument.
2429
```bash
25-
CUDA_VISIBLE_DEVICES='0,1' python -m src.tools.check_dist --num-gpu 2 --num-machine 2 --machine-rank 0
30+
CUDA_VISIBLE_DEVICES='0,1' python -m src.tools.check_dist --num-gpu 2 --num-machine 2 --machine-rank 0 --dist-port xxxxx
2631
```
2732

28-
* Other Machines
33+
### Other Machines
34+
In other machines, you clarify `machine-rank` within the range of 1~(num_machine-1).
35+
And you must set `dist-ip` and `dist-port` arguments which is the same with main machine values.
2936

3037
```bash
31-
CUDA_VISIBLE_DEVICES='0,1' python -m src.tools.check_dist --num-gpu 2 --num-machine 2 --machine-rank 1 --dist_ip xxx.xxx.xxx.xxx --dist_port xxxxx
38+
CUDA_VISIBLE_DEVICES='0,1' python -m src.tools.check_dist --num-gpu 2 --num-machine 2 --machine-rank 1 --dist-ip xxx.xxx.xxx.xxx --dist-port xxxxx
3239
```
3340

3441
## Test
35-
42+
Examples for collective communication functions in single machine.
43+
It also can be executed in multi-machine settings.
3644
### Gather
3745
```bash
3846
CUDA_VISIBLE_DEVICES='0,1' python -m src.tools.gather --num-gpu 2

0 commit comments

Comments
 (0)