|
1 | 1 | <p align="center"><img width="40%" src="./img/pytorch.png"></p>
|
2 | 2 |
|
3 |
| -## Multi Machine Tutorial for Pytorch |
4 |
| -xxxxx |
| 3 | +# Multi Machine Tutorial for Pytorch |
| 4 | +It works TCP communication for multi-gpu processing. |
| 5 | +They automatically find unused port address and |
5 | 6 |
|
6 | 7 | ## Requirements
|
7 | 8 | * pytorch : 1.5
|
8 | 9 | * CUDA : 10.1
|
9 | 10 |
|
10 |
| -## Base |
11 |
| - |
12 |
| -### Single Machine |
| 11 | +## Multi GPU in Single Machine |
| 12 | +For only multi-gpu processing in single machine, you only need to clarify `num-gpu` argument. |
13 | 13 | ```bash
|
14 | 14 | CUDA_VISIBLE_DEVICES='0,1' python -m src.tools.check_dist --num-gpu 2
|
15 | 15 | ```
|
16 | 16 |
|
| 17 | + |
| 18 | +## Multi Machines |
| 19 | +### Main Machine |
| 20 | +[For collective communication](https://tutorials.pytorch.kr/intermediate/dist_tuto.html#collective-communication) in pytorch, it needs to execute process in main machine. |
| 21 | +They automatically set main machine ip address and unused port number for TCP communication. |
| 22 | +`num-machine` |
| 23 | +set `machine-rank` to zero. |
17 | 24 | ```bash
|
18 |
| -CUDA_VISIBLE_DEVICES='0,1' python -m src.tools.check_dist --num-gpu 2 --dist_port 47515 |
| 25 | +CUDA_VISIBLE_DEVICES='0,1' python -m src.tools.check_dist --num-gpu 2 --num-machine 2 --machine-rank 0 |
19 | 26 | ```
|
20 | 27 |
|
21 |
| -### Multi Machines |
22 |
| -* Main Machine |
23 |
| - |
| 28 | +When you want to use a fixed port number, just clarify `dist-port` argument. |
24 | 29 | ```bash
|
25 |
| -CUDA_VISIBLE_DEVICES='0,1' python -m src.tools.check_dist --num-gpu 2 --num-machine 2 --machine-rank 0 |
| 30 | +CUDA_VISIBLE_DEVICES='0,1' python -m src.tools.check_dist --num-gpu 2 --num-machine 2 --machine-rank 0 --dist-port xxxxx |
26 | 31 | ```
|
27 | 32 |
|
28 |
| -* Other Machines |
| 33 | +### Other Machines |
| 34 | +In other machines, you clarify `machine-rank` within the range of 1~(num_machine-1). |
| 35 | +And you must set `dist-ip` and `dist-port` arguments which is the same with main machine values. |
29 | 36 |
|
30 | 37 | ```bash
|
31 |
| -CUDA_VISIBLE_DEVICES='0,1' python -m src.tools.check_dist --num-gpu 2 --num-machine 2 --machine-rank 1 --dist_ip xxx.xxx.xxx.xxx --dist_port xxxxx |
| 38 | +CUDA_VISIBLE_DEVICES='0,1' python -m src.tools.check_dist --num-gpu 2 --num-machine 2 --machine-rank 1 --dist-ip xxx.xxx.xxx.xxx --dist-port xxxxx |
32 | 39 | ```
|
33 | 40 |
|
34 | 41 | ## Test
|
35 |
| - |
| 42 | +Examples for collective communication functions in single machine. |
| 43 | +It also can be executed in multi-machine settings. |
36 | 44 | ### Gather
|
37 | 45 | ```bash
|
38 | 46 | CUDA_VISIBLE_DEVICES='0,1' python -m src.tools.gather --num-gpu 2
|
|
0 commit comments