Quantization-aware training에 대한 code 정리

1. 신경망 모델 양자화(quantization)에 대한 접근 방법

post-training quantization

quantization-aware training

개념

floating point 모델로 training을 진행하고 결과 weight값들에 대하여

양자화(quantization)를 적용함

학습 진행 시점에 inference 시의 양자화 적용에 의한 영향을 미리

simulation(modeling) 수행함

장단점

파라미터 size가 큰 대형 모델에 대해서는 정확도 하락 폭이 작으며,

파라미터 size가 작은 소형 모델에 대해서는 적합하지 않음

양자화 모델의 정확도 하락을 최소화할 수 있음

quantization-aware training을 통하여 forward / backward pass에서 weight들과 활성화 함수 출력에 대한 양자화를 simulation함.

fake quantization node를 추가하여 forward / backward pass에서 양자화 적용 시의 영향을 simulation 함.

quantization-aware training 중에 활성화 함수(activation)의 실제 출력 범위(최대/최소) 확인도 진행되어 추가적인 calibration step을 생략할 수 있음.

batch normalization에 대해서는 inference 시에 folding(inference 시에 형태가 간략화됨)되는 것을 적용하여 simulation 수행함.

2. fake quantized training graph 생성

모델에서 필요한 위치에 fake quantization node를 개별적으로 추가하는 것은 어려우므로 training graph를 양자화 simulation 용도로 rewrite하는 function을 사용함

input_graph: The tf.Graph to be transformed.
quant_delay: Number of steps after which weights and activations are quantized during training.
If one wants to train a quantized model from scratch, quant_delay should be set to the number of steps it take the floating point model to converge.
Quantization will be activated at this point and effectively finetune the model. If quant_delay is not provided when training from scratch, training can often fail.