Directory
- 1. Necessary steps Quick Overview
- 2. Detail each step
- 2.1 Floating point model preparation
- < font color = “blue” size = “4” > 2.2 calibration data < br > < / font >
- < font color = “blue” size = “4” > 2.3 quantitative training < br > < / font >
- < font color = “blue” size = “4” > 2.4 fixed-point conversion < br > < / font >
- < font color = “blue” size = “4” > 2.5 model compiled < br > < / font >
- 3. Common Problems
Everyone will choose the QAT solution, presumably there should be some basic concepts for “quantization”. Generally speaking, it is most recommended to try PTQ (post training quantization) first, after all, compared with QAT, it is easy to get started, the cost is small, and the floating point model code is not intrusive. The accuracy of Horizon PTQ solution is at the industry-leading level, and the quantization loss can be controlled within 1% for most common vision tasks. However, no matter how good the off-line quantization algorithm is, it cannot guarantee that the quantization accuracy of all models can meet the business needs. In addition, PTQ only uses a small amount of calibration data. Even if the accuracy index is similar on the verification set, the generalization of the model may still be affected due to the information loss inevitably introduced by quantization. quantization aware training (QAT) takes the quantization error as the noise of training, so that the model can learn to adapt to such noise continuously during the training process, and the obtained model parameters will be more robust to quantization in the end. Theory more than useless, then floating point model to QAT model “last mile” how to quickly arrive? Pytorch 1.8 after the launch of the torch. Fx ([* * * *] official documentation (https://pytorch.org/docs/stable/fx.html#)) can forward process, automatic tracking model can greatly reduce degree of difficulty of using QAT, the corresponding quantitative plan: FX Graph Mode Quantization, compared with Eager Mode Quantization, has a much higher degree of automation and correspondingly lower operation complexity. However, the user needs to adjust the model so that it is “symbolically traceable”. Let’s take a look at the specific steps below:
1. Quick overview of necessary steps
The whole QAT solution consists of five steps from floating point to deployment model: floating point model preparation, data calibration, quantization training (optional), fixed point conversion, and model compilation. The necessary steps and sample code are shown below. For detailed instructions and precautions for each step, refer to the following section. For a complete example, refer to the fx_mode.py script in the /ddk/samples/ai_toolchain/horizon_model_train_sample/plugin_basic directory in the OE development package. ** It is highly recommended to skip the training process and complete the prepare->convert->check step before quantization training (even during the floating-point model design phase) to ensure that the model is hardware supported. **
from horizon_plugin_pytorch.quantization import (
convert_fx,
prepare_qat_fx,
set_fake_quantize,
FakeQuantState,
check_model,
compile_model,
)
from horizon_plugin_pytorch.quantization.qconfig import (
default_calib_8bit_fake_quant_qconfig,
default_qat_8bit_fake_quant_qconfig,
default_calib_8bit_weight_32bit_out_fake_quant_qconfig,
default_qat_8bit_weight_32bit_out_fake_quant_qconfig
)
from horizon_plugin_pytorch.march import March, set_march
set_march(March.BAYES)
float_model = load_float_model(pretrain=True)
calib_model = prepare_qat_fx(
copy.deepcopy(float_model),
{
"": default_calib_8bit_fake_quant_qconfig,
"module_name": {
"classifier": default_calib_8bit_weight_32bit_out_fake_quant_qconfig,
},
},
)
calib_model.eval()
set_fake_quantize(calib_model, FakeQuantState.CALIBRATION)
calibrate(calib_model)
calib_model.eval()
set_fake_quantize(calib_model, FakeQuantState.VALIDATION)
evaluate(calib_model)
torch.save(calib_model.state_dict(), "calib-checkpoint.ckpt")
qat_model = prepare_qat_fx(
copy.deepcopy(float_model),
{
"": default_qat_8bit_fake_quant_qconfig,
"module_name": {
"classifier": default_qat_8bit_weight_32bit_out_fake_quant_qconfig,
},
},
)
qat_model.load_state_dict(calib_model.state_dict())
qat_model.train()
set_fake_quantize(qat_model, FakeQuantState.QAT)
train(qat_model)
qat_model.eval()
set_fake_quantize(qat_model, FakeQuantState.VALIDATION)
evaluate(qat_model)
base_model = qat_model # 校准精度满足预期时使用 calib_model
quantized_model = convert_fx(base_model)
evaluate(quantized_model)
script_model = torch.jit.trace(quantized_model.cpu(), example_input)
check_model(script_model, [example_input])
compile_model(script_model,[example_input],hbm="model.hbm",input_source="pyramid",opt=3)
2. Explain each step in detail
2.1 Floating-point model preparation
a. Please use enough data to train the floating point model normally until it converges before quantizing training. b. It is strongly recommended to normalize the input data, which is conducive to floating point convergence and makes the model more quantization-friendly. c. You are advised to check the supported operators list during the design phase of the floating point model to avoid prepare qat or compilation errors caused by using unsupported operators. d. If cpu operators are used in the model and you want to compile them into the model, refer to the user manual 4.2.4.4. Heterogeneous model guide for transformation compilation. e. For more instructions on how to build a quantitative friendly model, refer to the user manual 4.2.4.1 Requirements for Floating point Models
Although fx mode is less invasive to the original floating point model code than eager mode, it still needs to make some necessary modifications to the floating point model to support subsequent quantization operations.
-
Insert the QuantStub node before model input and the DequantStub node after model output. There are the following precautions:
-
Multiple inputs can share a QuantStub only if they are of the same scale, otherwise define a separate QuantStub for each input
-
it is recommended to use horizon_plugin_pytorch. Quantization. QuantStub default input scale dynamic statistics, if can calculate the scale of the scene ahead suggest manually scale (for example, bev homo matrix) of the model, Version of the corresponding interface torch. The quantization. QuantStub does not support manual Settings.
-
It is suggested that the parts of the model that do not need to be quantized, such as before and after processing and loss, should not be written in the forward function of the model, so as to avoid being mistakenly inserted into the pseudo-quantization node, which will affect the accuracy of the model.
-
for dynamic control flow as well as some python built-in function – such as trace an unsupported operation (see [official description * * * *] (https://pytorch.org/docs/2.0/fx.html#non-torch-functions)), It needs to be defined separately and modified with wrap, recommended as follows:
from horizon_plugin_pytorch.utils.fx_helper import wrap as fx_wrap
@fx_wrap()
def test(self, x):
if self.training:
passdef forward(self, x):
···
x = self.test(x)
return x
2.2 Data calibration
For some models, the accuracy can be achieved by Calibration alone, without time-consuming quantitative perception training. Even if the model cannot meet the accuracy requirements after quantitative calibration, this process can reduce the difficulty of subsequent quantitative perception training, shorten the training time, and improve the final training accuracy. Data Calibration way of specific configuration and tuning and Suggestions reference [QAT solution Calibration instructions * * * *] (https://developer.horizon.cc/forumDetail/177840589839214596).
2.3 Quantitative training
Some recommended hyperparameter configurations for quantitative training are shown in the following table:
Hyperparameter
Recommended configuration
Advanced configuration (try if the recommended configuration is invalid)
LR
Do scale=0.1 lr decay 2 times with StepLR starting from 0.001
1. Adjust lr between 0.0001 and >0.001 to match 1-2 lr decay. 2. The LR update policy can also try replacing StepLR with CosLR. 3. QAT uses AMP, appropriately lower lr, and too large results in nan.
Epoch
10% of the floating epoch
1. Based on the convergence of loss and metric, consider whether the epoch needs to be extended appropriately.
Weight decay
Consistent with floating point
1. It is recommended to make appropriate adjustments near 4e-5. Too small weight decay results in too large weight variance, which results in too large weight variance at the output layer of tasks with large output.
optimizer
Consistent with floating point
1. If floating-point training uses an optimizer such as OneCycle that affects LR Settings, it is recommended not to be consistent with floating-point and to use SGD instead.
transforms (data enhancement)
Transforms
Consistent with floating point
1. QAT stage can be appropriately weakened, such as the color conversion of classification can be removed, and the proportion range of RandomResizeCrop can be appropriately reduced
averaging_constant(qconfig_params)
1. Use calibration to recommend de-activation update: weight averaging_constant=1.0 activation averaging_constant=0.0
1. If there is a large calibration difference between activation averaging_constant and floating point, do not set it to 0.0 2. weight averaging_constant generally does not need to be set to 0.0, but can be adjusted between (0,1.0)
It is highly recommended that you try data calibration first, and then quantize training if the accuracy does not meet expectations (be careful to load the weight parameters after data calibration). Suggestions for tuning the quantitative training phase can be found in the [** User Manual] Quantitative training accuracy tuning recommendation * *] (HTTP: / / https://developer.horizon.cc/api/v1/fileData/horizon_j5_open_explorer_cn_doc/plugin/source/user_guide/debu G_precision.html #a-name-recommended-configuration-a).
2.4 Fixed point conversion
** Please note that there is no exact numerical agreement between the fixed-point model and the pseudo-quantized model, so please use the accuracy of the fixed-point model. If the fixed point accuracy is not up to the standard, it is still necessary to continue the quantization training, and it is recommended to retain several epochs of the qat model weights to facilitate the search for the optimal fixed point accuracy. (High accuracy of qat or calibrate does not necessarily mean high accuracy of fixed point, you can consider some backtracking to balance the final accuracy of fixed point) **
Under normal circumstances, the accuracy of the fixed-point model is exactly the same as that of the board deployment, so the model can be used to evaluate the final deployment accuracy.
2.5 Model compilation
The model compilation phase consists of the following three steps:
script_model = torch.jit.trace(quantized_model, example_input)
check_model(script_model.cpu(), [example_input])
compile_model(script_model,[example_input],hbm="model.hbm",input_source="pyramid",opt=O3)
compile_model()
3. Common Problems
**1. Why use deepcopy before preparing? ** Answer: The prepare_qat_fx and convert_fx interfaces do not support inplace parameters, so the input and output models of the two interfaces share almost all properties, so it is recommended to use deepcopy to make sure that the original input model is not changed. If you do not need to keep the input model and you are not using deepcopy, do not make any changes to the input model. **2. Why set high precision output? ** A: According to the introduction in the background of neural network quantization, the activation value calculated by the multiplication accumulator is int32. In order to continue the calculation of the next layer op, it will be changed to int8/int16 through requantization. Therefore, if the last layer is a conv/linear node, it is recommended to set a high-precision output. The model can be output directly in int32 format, which is of great benefit for accuracy preservation.
In addition, it is also possible to configure high-precision output by ‘model.classifier.qconfig = default_qat_8bit_weight_32bit_out_fake_quant_qconfig’ before preparing qat. If the priority of this mode is higher than that of prepare, pass the qconfig configuration into the dict.
plugin ≤ v1.6.2 To configure high-precision output, use default_calib_out_8bit_fake_quant_qconfig. However, this parameter will be deprecated in later versions
**3. How to understand several states of fake quantize? ** There are three states of fake quantize. set_fake_quantize should be used to set the model’s fake quantize to the corresponding state before QAT, calibration and validation respectively. In calibration state, only the statistics of input and output of each operator are observed. In the QAT state, a pseudo-quantization operation is performed in addition to the observed statistics. In the validation state, statistics are not observed and only pseudo-quantization operations are performed.
class FakeQuantState(Enum):
QAT = "qat"
CALIBRATION = "calibration"
VALIDATION = "validation"