Issues about the performance

Hi, thanks for your nice job. 

I ran the training script provided in this repo, and did not change any code. However, there is a significant performance gap between the code and your paper (for example, 0.588 v.s. 0.715 AUROC on tinyimagenet). 

Should I tune some learning parameters for increasing accuracy? I have tried to adjust the lr and epochs but it does not work.  I am looking forward to your insightful suggestions for this.

```
(cvaecaposr) xx@xxx:~/cvaecaposr$ sh ./scripts/train_tinyimagenet.sh
{
    "data_base_path": "./data",
    "val_ratio": 0.2,
    "seed": 1234,
    "known_classes": [
        2,
        3,
        13,
        30,
        44,
        45,
        64,
        66,
        76,
        101,
        111,
        121,
        128,
        130,
        136,
        158,
        167,
        170,
        187,
        193
    ],
    "unknown_classes": [
        0,
        1,
        4,
        5,
        6,
        7,
        8,
        9,
        10,
        11,
        12,
        14,
        15,
        16,
        17,
        18,
        19,
        20,
        21,
        22,
        23,
        24,
        25,
        26,
        27,
        28,
        29,
        31,
        32,
        33,
        34,
        35,
        36,
        37,
        38,
        39,
        40,
        41,
        42,
        43,
        46,
        47,
        48,
        49,
        50,
        51,
        52,
        53,
        54,
        55,
        56,
        57,
        58,
        59,
        60,
        61,
        62,
        63,
        65,
        67,
        68,
        69,
        70,
        71,
        72,
        73,
        74,
        75,
        77,
        78,
        79,
        80,
        81,
        82,
        83,
        84,
        85,
        86,
        87,
        88,
        89,
        90,
        91,
        92,
        93,
        94,
        95,
        96,
        97,
        98,
        99,
        100,
        102,
        103,
        104,
        105,
        106,
        107,
        108,
        109,
        110,
        112,
        113,
        114,
        115,
        116,
        117,
        118,
        119,
        120,
        122,
        123,
        124,
        125,
        126,
        127,
        129,
        131,
        132,
        133,
        134,
        135,
        137,
        138,
        139,
        140,
        141,
        142,
        143,
        144,
        145,
        146,
        147,
        148,
        149,
        150,
        151,
        152,
        153,
        154,
        155,
        156,
        157,
        159,
        160,
        161,
        162,
        163,
        164,
        165,
        166,
        168,
        169,
        171,
        172,
        173,
        174,
        175,
        176,
        177,
        178,
        179,
        180,
        181,
        182,
        183,
        184,
        185,
        186,
        188,
        189,
        190,
        191,
        192,
        194,
        195,
        196,
        197,
        198,
        199
    ],
    "split_num": 0,
    "batch_size": 32,
    "num_workers": 0,
    "dataset": "tiny_imagenet",
    "z_dim": 128,
    "lr": 5e-05,
    "t_mu_shift": 10.0,
    "t_var_scale": 0.01,
    "alpha": 1.0,
    "beta": 0.01,
    "margin": 10.0,
    "in_dim_caps": 16,
    "out_dim_caps": 32,
    "checkpoint": "",
    "mode": "train",
    "epochs": 100
}
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]

  | Name    | Type      | Params
--------------------------------------
0 | enc     | ResNet34  | 21.3 M
1 | vae_cap | VaeCap    | 23.5 M
2 | fc      | Linear    | 10.5 M
3 | dec     | Decoder   | 760 K
4 | t_mean  | Embedding | 51.2 K
5 | t_var   | Embedding | 51.2 K
--------------------------------------
56.1 M    Trainable params
0         Non-trainable params
56.1 M    Total params
224.552   Total estimated model params size (MB)
/xx/anaconda3/envs/cvaecaposr/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: The dataloader, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 80 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)
/xx/anaconda3/envs/cvaecaposr/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: The dataloader, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 80 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)
Epoch 20: 100%|██████████▉| 312/313 [00:38<00:00,  8.21it/s, loss=4.99e+03, v_num=0, train_acc=0.938, validation_acc=0.456Epoch    21: reducing learning rate of group 0 to 2.5000e-05.█████████████████████████████▊ | 62/63 [00:03<00:00, 16.56it/s]
Epoch 28: 100%|██████████▉| 312/313 [00:37<00:00,  8.35it/s, loss=3.31e+03, v_num=0, train_acc=0.906, validation_acc=0.460Epoch    29: reducing learning rate of group 0 to 1.2500e-05.█████████████████████████████▊ | 62/63 [00:03<00:00, 16.48it/s]
Epoch 42: 100%|██████████▉| 312/313 [00:38<00:00,  8.20it/s, loss=2.32e+03, v_num=0, train_acc=0.906, validation_acc=0.459Epoch    43: reducing learning rate of group 0 to 6.2500e-06.█████████████████████████████▊ | 62/63 [00:03<00:00, 16.35it/s]
Epoch 48: 100%|██████████▉| 312/313 [00:37<00:00,  8.35it/s, loss=1.88e+03, v_num=0, train_acc=0.969, validation_acc=0.474Epoch    49: reducing learning rate of group 0 to 3.1250e-06.█████████████████████████████▊ | 62/63 [00:03<00:00, 16.42it/s]
Epoch 54: 100%|██████████▉| 312/313 [00:37<00:00,  8.25it/s, loss=2.41e+03, v_num=0, train_acc=0.938, validation_acc=0.465Epoch    55: reducing learning rate of group 0 to 1.5625e-06.█████████████████████████████▊ | 62/63 [00:03<00:00, 16.49it/s]
Epoch 60: 100%|██████████▉| 312/313 [00:37<00:00,  8.35it/s, loss=1.72e+03, v_num=0, train_acc=1.000, validation_acc=0.468Epoch    61: reducing learning rate of group 0 to 7.8125e-07.█████████████████████████████▊ | 62/63 [00:03<00:00, 16.53it/s]
Epoch 66: 100%|██████████▉| 312/313 [00:37<00:00,  8.35it/s, loss=1.88e+03, v_num=0, train_acc=1.000, validation_acc=0.471Epoch    67: reducing learning rate of group 0 to 3.9063e-07.█████████████████████████████▊ | 62/63 [00:03<00:00, 16.44it/s]
Epoch 72: 100%|██████████▉| 312/313 [00:38<00:00,  8.21it/s, loss=1.62e+03, v_num=0, train_acc=1.000, validation_acc=0.466Epoch    73: reducing learning rate of group 0 to 1.9531e-07.█████████████████████████████▊ | 62/63 [00:03<00:00, 16.36it/s]
Epoch 78: 100%|██████████▉| 312/313 [00:37<00:00,  8.35it/s, loss=1.15e+03, v_num=0, train_acc=1.000, validation_acc=0.472Epoch    79: reducing learning rate of group 0 to 9.7656e-08.█████████████████████████████▊ | 62/63 [00:03<00:00, 16.41it/s]
Epoch 84: 100%|██████████▉| 312/313 [00:40<00:00,  7.73it/s, loss=1.48e+03, v_num=0, train_acc=0.969, validation_acc=0.470Epoch    85: reducing learning rate of group 0 to 4.8828e-08.█████████████████████████████▊ | 62/63 [00:04<00:00, 15.34it/s]
Epoch 90: 100%|██████████▉| 312/313 [00:38<00:00,  8.04it/s, loss=1.68e+03, v_num=0, train_acc=0.938, validation_acc=0.472Epoch    91: reducing learning rate of group 0 to 2.4414e-08.█████████████████████████████▊ | 62/63 [00:03<00:00, 16.67it/s]
Epoch 96: 100%|██████████▉| 312/313 [00:38<00:00,  8.09it/s, loss=1.82e+03, v_num=0, train_acc=0.938, validation_acc=0.471Epoch    97: reducing learning rate of group 0 to 1.2207e-08.█████████████████████████████▊ | 62/63 [00:03<00:00, 16.40it/s]
Epoch 99: 100%|███████████| 313/313 [00:38<00:00,  8.10it/s, loss=1.76e+03, v_num=0, train_acc=1.000, validation_acc=0.468Saving latest checkpoint...
Epoch 99: 100%|███████████| 313/313 [00:40<00:00,  7.73it/s, loss=1.76e+03, v_num=0, train_acc=1.000, validation_acc=0.468]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
/xx/anaconda3/envs/cvaecaposr/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: The dataloader, test dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 80 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)
Testing: 100%|██████████████████████████████████████████████████████████████████████████▊| 312/313 [00:23<00:00, 13.07it/s]/xx/anaconda3/envs/cvaecaposr/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: Metric `AUROC` will save all targets and predictions in buffer. For large datasets this may lead to large memory footprint.
  warnings.warn(*args, **kwargs)
Testing: 100%|███████████████████████████████████████████████████████████████████████████| 313/313 [00:23<00:00, 13.10it/s]
--------------------------------------------------------------------------------
DATALOADER:0 TEST RESULTS
{'test_auroc': 0.5880855321884155}

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Issues about the performance #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issues about the performance #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions