2019년 1월 20일 일요일

RTX Titan FP16 comparison

Environment
-Baseline: cifar10 example[1], CEloss, DPN structure
-No test used, only training
-Scratch Initialization, 1-epoch only
-With data augmentation
-i7, 32gb cpu memory under windows10
-Anaconda env., pytorch1.0, python 3.7, cuda10, cudnn 7.4.2[3]

* RTX titan has 24Gb memory size, but video display shares it and occupies about 3 Gb,
so about 21 Gb mem can be used for dnn training.
* Turing architecture(rtx series) does not support cuda 9.x versions.


(1) Baseline(fp32) computation according to changes of batch_size
batch_size=128
--------------------------------
Epoch: 0
2019-01-21 11:15:03.973084
0 391 Loss: 2.436 | Acc: 6.250% (8/128)
50 391 Loss: 2.351 | Acc: 14.982% (978/6528)
100 391 Loss: 2.168 | Acc: 20.707% (2677/12928)
150 391 Loss: 2.065 | Acc: 23.996% (4638/19328)
200 391 Loss: 1.988 | Acc: 26.877% (6915/25728)
250 391 Loss: 1.922 | Acc: 29.451% (9462/32128)
300 391 Loss: 1.867 | Acc: 31.452% (12118/38528)
350 391 Loss: 1.817 | Acc: 33.224% (14927/44928)
2019-01-21 11:18:36.301463 ==>  3min 33sec.

batch_size=256
----------------------
Epoch: 0
2019-01-21 11:21:24.807364
0 196 Loss: 2.315 | Acc: 10.156% (26/256)
50 196 Loss: 2.201 | Acc: 17.831% (2328/13056)
100 196 Loss: 2.018 | Acc: 24.760% (6402/25856)
150 196 Loss: 1.905 | Acc: 29.134% (11262/38656)
2019-01-21 11:24:48.638777 ==> 3min 24sec.


batch_size=512: Memory error!!
----------------------------------



(2) FP16 for baseline 
-Change batch_size with fixed static_loss_scale(128.0)[2].
-Check GPU memory for increasing batch_size

batch_size=128
--------------------------------
Epoch: 0
2019-01-21 11:10:35.342605
0 391 Loss: 2.371 | Acc: 7.812% (10/128)
50 391 Loss: 2.421 | Acc: 13.480% (880/6528)
100 391 Loss: 2.242 | Acc: 18.162% (2348/12928)
150 391 Loss: 2.115 | Acc: 21.797% (4213/19328)
200 391 Loss: 2.029 | Acc: 24.662% (6345/25728)
250 391 Loss: 1.951 | Acc: 27.596% (8866/32128)
300 391 Loss: 1.891 | Acc: 29.893% (11517/38528)
350 391 Loss: 1.831 | Acc: 31.987% (14371/44928)
2019-01-21 11:13:24.088977 : 2min 49sec

batch_size=256 ==> 13Gb occupied in GPU memory
--------------------------------
Epoch: 0
2019-01-21 11:37:30.438707
0 196 Loss: 2.406 | Acc: 7.031% (18/256)
50 196 Loss: 2.273 | Acc: 15.709% (2051/13056)
100 196 Loss: 2.078 | Acc: 22.625% (5850/25856)
150 196 Loss: 1.947 | Acc: 27.491% (10627/38656)
2019-01-21 11:39:59.184161 ==> 2min 30sec

batch_size=512 ==>  21.87Gb occupied in GPU
--------------------------------
Epoch: 0
2019-01-21 11:27:54.565118
0 98 Loss: 2.329 | Acc: 13.086% (67/512)
50 98 Loss: 2.182 | Acc: 17.601% (4596/26112)
2019-01-21 11:30:14.626127 ==> 2min 20sec


batch_size=512+128 ==> Memory Error!!
--------------------------------


(3) For changes of static_loss_scale with fixed batch_size(512)

static_loss_scale=64.0
-----------------------------------------------
Epoch: 0
2019-01-21 11:45:43.910621
0 98 Loss: 2.337 | Acc: 11.914% (61/512)
50 98 Loss: 2.167 | Acc: 18.624% (4863/26112)
2019-01-21 11:48:03.947331 ==> 2min 20sec


static_loss_scale=256.0
-------------------------------------------
Epoch: 0
2019-01-21 11:49:50.158676
0 98 Loss: 2.413 | Acc: 8.789% (45/512)
50 98 Loss: 2.224 | Acc: 16.885% (4409/26112)
2019-01-21 11:52:10.582739 ==> 2min 20sec



References
1. pytorch cifar10
2. NVIDIA apex
3. cuDNN support matrix