QAT Accuracy Tuning Recommendations

D-Robotics · December 27, 2023, 6:51am

1. Precision tuning recommendation process

The QAT solution consists of five steps from floating point to deployed model: floating point model preparation, data calibration, quantization training (optional), fixed point conversion, and model compilation. Each stage has its own accuracy verification and consistent alignment requirements (especially when the board inference results are inconsistent with the python side, Horizon provides some tools to help with analysis), Specific operation way recommended reference [* * * *] QAT consistent alignment process # (https://developer.horizon.cc/forumDetail/185446132959372305). When accuracy loss occurs during data calibration, quantization training, or fixed-point conversion, it is recommended to refer to the following process for tuning:

Since the quantization tool uses symmetric uniform quantization, some optimization methods can be adopted in the floating point model building/training stage to make the model more quantization-friendly. For details, see [** User Manual Building Quantization-Friendly Models **]. The following article mainly share the accuracy analysis and tuning methods of calibration, QAT and fixed point conversion.

2. Calibration stage precision tuning suggestion

Calibration is recommended before quantitative training, on the one hand, because the Calibration time is short, some models can meet the accuracy requirements only by Calibration, which can avoid the time-consuming quantitative training process. On the other hand, Calibration can also accelerate the quantization training convergence of the model after initializing the quantization parameters.

calibration/float accuracy ratio < 80% is an empirical value for reference only (some models with calibration/float accuracy ratio < 80% can also achieve the expected accuracy through QAT training).

Calibration accuracy loss < 20% Calibration using the default qconfig configuration, accuracy loss is small, Suggestions can reference [QAT solution Calibration instructions * * * *] # (https://developer.horizon.cc/forumDetail/177840589839214596) In Chapters 2.2 and 4, the optimization and debugging of hyperparameters, calibration algorithms, and data sets are carried out. If there is still a small loss after trying to tune, quantitative training is recommended.

2.2 Calibration loss ≥ 20%

It is recommended to check whether the accuracy of the floating point model is normal (whether it is normal convergence, overfitting, loading weight is correct, etc.);
When using the default configuration, the Calibration loss is large, or when there is still a large loss after attempting the calibration suggestions in the previous section, it is recommended to refer to Chapter 5 below to use the analysis tool to locate the quantized exception layer.

3. Suggestions for precision tuning in QAT quantitative training stage

** Quantization parameter initialization ** The default Calibration is recommended for better precision and faster convergence in QAT training. Calibration accuracy loss is small, Recommended reference [QAT solution Calibration instructions * * * *] # (https://developer.horizon.cc/forumDetail/177840589839214596), the third chapter fixed activation scale, Set activation averaging_constant=0.0.
**Transform (Data Enhancement) ** It is recommended to keep the default QAT consistent with floating point, and it can also be appropriately weakened, such as the color conversion of the classification can be removed, and the proportion range of RandomResizeCrop can be appropriately reduced.
Optimizer By default, QAT is consistent with floating point, but you can also try SGD. If floating-point training uses an optimizer such as OneCycle that affects LR Settings, it is recommended not to be consistent with floating point and use SGD instead.
Exception handling:
- ** NAN** :
1. Check whether the accuracy of the floating point model is normal;
2. Check the data and label for nan and inf;
3. Lower the learning rate, or use warmup strategy;
4. Gradient truncation using torch.nn.utils.clip_grad_norm_.
- **loss Abnormal **
1. Check whether the Calibration parameters are correctly loaded

If the quantization accuracy cannot be effectively improved after debugging according to the above suggestions, it is recommended to use the analysis tool to locate the quantization exception layer by referring to Chapter 5 below.

4. Suggestions for tuning fixed-point accuracy

The first suggestion of [* * * *] QAT consistent alignment process (# https://developer.horizon.cc/forumDetail/185446132959372305) before and after the screening is introduced processing error;
Check whether the model output is abnormal due to incorrect modifications made after loading ckpt;
If you want to Calibration direct Calibration point, and the Calibration accuracy loss is small, but the Calibration accuracy loss is large, please confirm whether the calibration stage is wrong (usually due to incorrect pseudo-quantization node state Settings, resulting in the calibration stage test is floating point accuracy). Recommended reference [QAT solution Calibration instructions * * * *] # (https://developer.horizon.cc/forumDetail/177840589839214596), the sixth chapter third common problems, Correctly set model state using set_fake_quantize);
If the Calibration point is directly changed, the accuracy loss of the point is not too large, you can continue qat, and try qat models with different epochs to find the best accuracy of the point.
If the above strategies have no obvious benefits, it is recommended to refer to Chapter 5 below and use analysis tools to locate the quantified anomaly layer.

5. Analysis Tool User Guide

The QAT solution provides the corresponding profiler toolkit ‘horizon plugin-profiler’, which is pre-installed in the GPU docker tool chain. If you use it for local installation, you can obtain the whl installation package from the path of the OE development package: ‘OE/ddk/package/host/ai_toolchain’. You are advised to refer to [** User Manual] for details about how to use the tool package and how to configure the parameters of each interface Analysis tools use guide * *] (# https://developer.horizon.cc/api/v1/fileData/horizon_j5_open_explorer_cn_doc/plugin/source/user_guide/debug html#plugin-source-user-guide-debug-tools–page-root), This article introduces only a few commonly used precision tuning interfaces (fuse checking, shared op checking, quantization configuration checking, similarity comparison, statistics, step quantization, single operator conversion precision debugging).

**Since the qat phase will change the weight of the model, it is not recommended to compare the floating point /calibration model with the qat model;

Please use the real data that can reproduce the accuracy problem as the input of the analysis interface, otherwise the analysis result may be inconsistent with the actual (for example, the similarity is 0 or even negative);

Please call the set_fake_quantize interface to set the calibration model to the VALIDATION state before using the analysis tool to avoid abnormal analysis results because the pseudo-quantization node is not effective.

**

For first-time users, it is recommended to use the integration interface to check similarity, statistics, shared op, fuse pattern, and quantization configuration in one go, as follows:

from horizon_plugin_profiler import check_unfused_operations

float_model = load_float_model(pretrain=True)

calib_model = prepare_qat_fx(copy.deepcopy(float_model), {"": default_calib_8bit_fake_quant_qconfig})
calib_model.eval()
# set FakeQuantState. Please don't set after CALIBRATION. The eval (), avoid scale not update anomalies
set_fake_quantize(calib_model, FakeQuantState.CALIBRATION)
calibrate(calib_model, calib_dataloader, device) # Select the appropriate amount of data from the training set for model inference

# Before using any analysis tool, you need to set the model state to validation so that the pseudo-quantization node is in a normal state
set_fake_quantize(calib_model, FakeQuantState.VALIDATION)
# Two-input model run example
model_profiler(float_model, calib_model, (data0, data1), mode="FvsQ", out_dir="float_calib")

calib_model2 = deepcopy(calib_model)
quantized_net = convert_fx(calib_model2)

model_profiler(calib_model, quantized_net, (data0, data1), mode="QvsQ", out_dir="calib_quantized")
` ` `

You are advised to pay attention to the following results:
## 5.1 Shared op
### 5.1.1 Possible problems
- Sharing the op may cause the fuse of part of the model structure to fail
- Sharing an op may cause multiple ops to share the same quantization parameter, resulting in accuracy issues
5.1.2 Analysis Mode
> Note: Check based on floating point model (calibration and qat models cannot be checked related to the original model structure because fuse has been completed)

** Method 1** : Call integration interface 'model_profiler (mode=FvsQ)', open profiler.html, if there is a shared op will be marked in red:


Method 2** : Invoke the shared op interface to check get_module_called_count and observe the op called times > 1 in the printed result


** Method 3** : From the floating point model statistics or from the result file of similarity analysis (which may not be observed if the shared op is fuse), as shown in the figure below, the op with serial number (num) after the Module Name field is the shared op.

5.1.3 Optimization Suggestions
Define each op/block separately (except for the DeQuantStub node) to avoid defining multiple calls at once.
> If there are special design scenarios such as shared conv, you can copy two copies of op and load the same parameter to avoid abnormal fuse.

## 5.2 fuse is not correct
### 5.2.1 Possible problems
- Significant loss of model quantization accuracy (if each operator is calculated independently, the quantization parameters will be calculated separately)
- Longer model deployment delay (operator fusion can reduce the number of op in the deployment model and speed up model calculation)
5.2.2 Analysis Methods
> Note: Check based on floating point model (calibration and qat models cannot be checked related to the original model structure because fuse has been completed)

Observe any similarity analysis or non-original floating-point model statistics. If scattered bn, relu, and floatfunctional.add are found, it means that there may be phenomena that are not properly fuse.

Combined with the check result of check_unfused_operations, you can view the fusion suggestion of the operator:

> This article is based on the J5 OE1.1.68 version test. The fuse check result in the integration interface model_profiler (mode=FvsQ) profiler.html may have some missing checks, which will be fixed in the later version.
5.2.3 Optimization Suggestions
Listed in the reference user manual [support operator fusion range of * * * *] (# https://developer.horizon.cc/api/v1/fileData/horizon_j5_open_explorer_cn_doc/plugin/source/adv anced_content/op_fusion.html? highlight=%E7%AE%97%E5%AD%90%E8%9E%8D%E5%90%88%20%E7%AE%97%E5%AD%90%20%E8%9E%8D%E5%90%88) to fuse as much of the model as possible.
If you use the fx mode, the operator fusion is automatically completed during the prepare process. In general, unfuse occurs only when an abnormal op share or wrap occurs. Cancel the share as recommended in the preceding section and adjust the wrap range properly to fuse all structures that can be fuse.
## 5.3 Insufficient data resolution
Because the default qconfig for both calibration and qat is an int8 configuration, the data representation range is limited, so using the default configuration may cause a significant loss of precision in some layers.
5.3.1 Analysis Mode
#### 5.3.1.1 Model Input
Observe the maximum and minimum value of the QuantStub node in the result of the floating point model statistic profile_featuremap & get_raw_features. If the value is not between [-1,1], it indicates that the corresponding input node has not been normalized.
In addition, there are some special scenarios where the input node has a clear physical meaning (such as the grid input of the gridsample), and the input value does need to be a series of integers:
It is recommended to first determine whether the value range is int8 or int16 (the figure above obviously exceeds the representable range of int8), configure the corresponding qonfig, and then manually set the scale of the input node (scale=max/128.0 or scale=max/32768.0).

```language
import horizon_plugin_pytorch.nn as hnn

data_shape = (3, 512, 960)
grid_size = (128, 128)
def get_grid_quant_scale(grid_shape, view_shape):
    max_coord = max(*grid_shape, *view_shape)
    coord_bit_num = math.ceil(math.log(max_coord + 1, 2))
    coord_shift = 15 - coord_bit_num
    coord_shift = max(min(coord_shift, 8), 0)
    grid_quant_scale = 1.0 / (1 << coord_shift)
    return grid_quant_scale

view_shape = [data_shape[1] / 16, data_shape[2] / 16]
grid_quant_scale = get_grid_quant_scale(grid_size, view_shape)

class ViewTransformer(nn.Module):
    def __init__(self):
        self.quant_stub = QuantStub(grid_quant_scale)
        self.grid_sample = hnn.GridSample(
            mode="bilinear",
            padding_mode="zeros",
        )
    
    def forward(feats, points):
        trans_feat = self.grid_sample(feats,self.quant_stub(points[i]),)
    
    def set_qconfig(self) -> None:
        from horizon_plugin_pytorch.quantization.qconfig import default_calib_16bit_fake_quant_qconfig
        self.quant_stub.qconfig = default_calib_16bit_fake_quant_qconfig

5.3.1.2 Model Output

If the model output layer is conv/linear, Advice to ensure precision on high precision output (specific principle can see [* * QAT quick-and-dirty * *] # (https://developer.horizon.cc/forumDetail/191823403479697531), common problems first). There are several ways to determine whether the output layer is correctly opened for high-precision output, and we can choose one according to our habits:

Directly print calibration/qat model. If high precision output is enabled, the corresponding conv/linear node will not have (activation_post_process): The FakeQuantize field (A more detailed example can be seen at [**QAT FAQ] * * # (https://developer.horizon.cc/api/v1/fileData/faq_toolchain/faq_source/qat_faq.html), 6);
Observe the last column of the statistics result file. If high precision is enabled, the output type is float32;
Observe the qconfig check result file. If the high-precision check is enabled, the output type is float32.
Check whether the DeQuantStub node contains scale in the similarity result. If no scale, high-precision output is enabled; if yes, it is disabled.

Note that if high-precision output is configured for a non-output layer conv/linear, the calibration and qat phases result in an AttributeError: ‘NoneType’ object has no attribute ‘numel’.

5.3.1.3 Model Middle layer

In the same way as the analysis of model input, by observing the results of similarity (whether qscale matches the value range) and statistics (whether the maximum and minimum value exceed the value range of int8), we can determine whether it is the quantization loss caused by insufficient numerical resolution of a node. 5.3.2 Optimization Suggestions

5.3.2.1 Model Input

There are generally two types of model inputs: raw data (images, radar, etc.) and auxiliary input to the model (such as transformer position coding); These data need to be quantified before they can be used as the input of the quantization network. Since the quantization tool adopts a symmetrical and uniform quantization method, it is suggested to improve it by the following means:

Perform normalization of 0 symmetry for input data during data preprocessing;
Check whether the quantization configuration is reasonable. For example, it is recommended to use fixed quantization scale=1/128.0 for image input (if normalization is not processed to [-1,1], set scale=max/128.0, and max is the maximum absolute value of the input data set); However, the fixed scale may not be suitable for all data, and specific analysis is required.
If the data resolution requirements are relatively high and cannot be adjusted, it is recommended to use int16 quantization. Taking image input as an example, since the input range of the original image (whether RGB or YUV) is [0, 255], it is not suitable for symmetric quantization, and after the normalization of 0 symmetry, the input range becomes [-1, 1], which can be quantized directly using a fixed scale=1/128.0.

5.3.2.2 Model Output

Model output often has physical meaning, may require relatively high resolution, not suitable for int8 quantization, it is recommended:

Output is not quantified. Currently, when conv2d is output as a network, it supports output unquantization (that is, configuring the qconfig of the output node to be ‘default_calib_8bit_weight_32bit_out_fake_quant_qconfig’).
If the output needs to be quantized for reasons such as BPU performance, it is recommended to use int16 to quantify the output data, or reduce the output data resolution by adjusting the physical meaning of the output.

5.3.2.3 Model Middle Layer

From the perspective of implementation, there are two kinds of operators: 1. Single-granularity operators, such as conv2d; 2. Complex operators implemented by multiple small operators, such as layernorm; Here we mainly focus on the output of the whole operator, ignoring the output of the small operator inside the complex operator. If the operator outputs a large range of values, the following recommendations are made:

By modifying the model structure to limit the value to a certain range, different schemes can be adopted according to different operators, such as adding BN after conv2d and replacing relu with relu6;
Quantization with int16;
If you encounter a pattern like conv-[bn]-[add]-relu, you can try to specify relu6 in the QAT phase (not necessarily valid).

If the weight of a certain layer has a large range of values, you can:

Try to adjust weight-decay; It is recommended to make appropriate adjustments around 4e-5, not too large or too small. Too small weight decay leads to too large weight variance. Too large may lead to a chain reaction, such as the network layer output weight variance is too large.

5.4 Single operator loss exception

5.4.1 Analysis Mode The performance of this problem is that the loss of single operator is large, and the similarity decreases obviously after quantization or after turning point. For example, cat node in the following figure has obvious anomalies. Observing its input and output statistics, it is found that the value ranges of the two inputs are different: If no valid information can be obtained from the observation of similarity and statistic results, it is recommended to use the step quantization tool (check the quantization exception layer in the QAT stage) and the single operator conversion accuracy debugging tool (check the fixed-point accuracy loss exception layer).

# Step quantization (i.e. QAT stage still keeps part of op using floating-point calculation)
# fx mode
qat_model = prepare_qat_fx(
model,
{"": default_qat_8bit_fake_quant_qconfig},
hybrid=True,
hybrid_dict={"module_name": ["conv1",]})
# eager mode
model.conv1.qconfig = None
qat_model = prepare_qat(model, hybrid=True)
` ` `
```language
# Precision debugging of single operator conversion (i.e. keep part of op in QAT/calibration state after turning point)
# Configure after prepare_qat/prepare_qat_fx, either way
set_preserve_qat_mode(calib_net, ("conv1"), ())
calib_net.conv1.preserve_qat_mode = True

5.4.2 Optimization Suggestions There are several types of operators that are prone to large quantization error losses:

** Multi-input operators **, such as cat, if there is a large difference in the range of different input values, there may be a phenomenon of “large numbers eat small numbers” (cat output will recalculate the scale, so it will lead to the input information with a small range of values will be covered up), and eventually lead to abnormal accuracy. It is recommended to try the following ways to improve: a. Need to limit the input range by various means, so that the value range of multiple inputs is similar; b. Quantization using int16;
** Nonlinear activation operators ** (e.g. sigmoid, sqrt, log, reciprocal, etc.), the bottom layer is implemented by table lookup, which may result in insufficient resolution when the output is in a steep interval due to limited table lookup entries. Improvements can be attempted in the following ways: a. Evaluate whether this operator can be used less/not used or replaced with other activation operators that will not cause the range to increase; b. Limit the input range to a relatively gentle range; c. Quantization using int16; d. If the QAT accuracy is normal but the quantized accuracy is insufficient, it is recommended to seek Horizon technical support.
** Complex operators , such as layernorm and softmax, are generally composed of multiple small operators, among which there may be nonlinear activation operators mentioned above, which will also lead to precision problems. Try to improve in the following ways: a. Evaluate whether this operator can be omitted or replaced with another operator (for operators with higher accuracy risk, see [ User manual] The operator list * *] (# https://developer.horizon.cc/api/v1/fileData/horizon_j5_open_explorer_cn_doc/plugin/source/appendix/operator.h tml) remarks); b. If the QAT accuracy is normal but the quantized accuracy is insufficient, you can try to manually adjust the table lookup parameters, such as layernorm and softmax support manual parameters;