Active and passive quantization logic in the model conversion toolchain

Summary

This document mainly introduces the internal logic of operator quantization in the horizon model transformation tool chain, which is convenient for users to understand the quantization processing of some special scenarios.

Background and questions

When using the Horizon model conversion toolchain, users occasionally encounter the following problems:

After the model is successfully converted to bin, it is found that there are still some op running on the CPU, but after carefully comparing the horizon operator constraint list, it is clear that the op is in line with the operator constraint, that is, in theory, the operator should successfully run on the BPU. Why is it still CPU computing? This problem has caused some problems to many users. This article is used to introduce the internal quantization principle and the logic behind the model transformation tool chain, and introduce several solutions.

When will an op run on the BPU

For the model transformation toolchain, whether an op is quantified and runs on the BPU depends on two things:

  1. Whether the op meets the specifications supported by BPU;
  2. Can the op find a reasonable quantization threshold in the context? For the first point, the operator constraint document has given a clear constraint range, which can be very intuitive query. Considering that the operator runs in the process of BPU has the quantization step, and the second point is the clarification of quantization, that is, whether an op can find a reasonable quantization threshold for itself from the context subgraph. This depends on the design of calibration and quantization modules within the conversion tool chain, which are transparent to the user. It is precisely because of the quantization logic designed inside the conversion tool chain that the above “should run on the BPU but still on the CPU” problem appears. This article will abstract the technical details of quantization logic inside the tool chain and explain the reasonable introduction scenario of quantization threshold.

op quantifies logical classification

For each op, the research and development students of the conversion tool chain will design the quantization threshold logic according to the calculation characteristics of the op, the underlying logic of the BPU, the substructure of a single op in a large number of typical models, and the sensitivity of a single op to quantization. Through the analysis of the op that has been supported, the students of tool chain development divided the quantization logic of op into the following types: ** Active quantization, passive quantization, manual quantization **

Active quantization class operator

Quantization logic features

Active quantization operators refer to operators that can quantize as much as possible. Such operators have the following characteristics:

  1. Often the operator itself is a computationally intensive operator, or a computation type operator that BPU is good at.
  2. Through a large number of experiments, it is verified that the quantization accuracy risk of this type of operator in most scenarios is small.

Typical representative

  • computationally intensive operator: conv/matmul gemm/convtranspose, etc.
  • Activation and math class operators: mish/hardswish/sin/cos, etc.
  • elementwise operator: mul/add, etc.
  • Other: such as the argmax/reduce operator For active quantization operators, the conversion tool chain will adjust the threshold calibration position and other methods to ensure that the threshold value of calibration statistics can meet the quantization logic requirements of the operator as much as possible, so as to ensure that it is “quantized and run on the BPU as much as possible”.

Passive quantization class operator

A passive quantization class operator is an operator that passively “follows” quantization only if other operators in its context are also “suitable to be quantized in the BPU”. More simply, but not strictly, the operator is only quantized passively when both the operator and the operator are active quantization operators.

Quantization logic features

The main reason for the design of passive quantization logic is to consider the accuracy and performance of such operators running on BPU:

  • Performance perspective:
  1. Non-computation-intensive operators, relatively speaking, BPU is not so “good” at this kind of computation.
  2. The substructure of the operator is not supported by the BPU as a whole, which avoids individual quantization. The operator runs on the BPU but the overall operation efficiency is lower (floating-point quantize/dequantize calculation is introduced).
  • Accuracy Angle:
  1. Relatively speaking, this kind of operator has certain risk of quantization accuracy.
  2. Due to the characteristics of BPU, when this class of operators is used as the last output node of BPU, it is placed on the CPU for floating-point calculation to ensure higher accuracy (thus ensuring that the output of BPU is a fixed-point 32bit high-precision output).

Typical representative

  • data handling and data operation class operator: concat/slice/SpaceToDepth/gather/reshape/transpose, etc
  • operator: Pooling class averagepool/maxpool/globalaveragepool/globalmaxpool
  • resize Indicates the resize operator Since passive quantization logic is not well understood, let’s take an example: Of the two substructures listed below, a concat in the first substructure is not quantized by default by the conversion toolchain because it is not followed by an active quantization operator and because concat itself is a passive quantization operator. And the concat in the second substructure, since there are active quantization operators before and after, the concat will also be passively quantized, ensuring that all conv+concat+conv run on a segment of BPU.

Manually quantize class operators

Manually quantizing a class operator means that the conversion tool chain does not quantize the operator in the default configuration regardless of whether the operator conforms to the BPU constraint. The main reason for this is that through experiments, the tool chain developers found that the quantization accuracy of this kind of operator in some scenarios is relatively high risk, in order to ensure accuracy, the default operator will be calculated as float on the CPU. A typical example of this type of operator is softmax. At the same time, considering the different models and use scenarios of different users, the model conversion toolchain provides runonbpu function in yaml. Configuration, you only need to specify a manually quantized operator name runonbpu, you can manually quantify the operator to runon the BPU.

How to configure the passive quantization node to run on the BPU?

The reason behind the design of the quantization logic in this way is the maximum common divisor of various scenarios + the balance of precision and performance. However, for the specific scenario of a specific user, the user has better balance accuracy and performance considerations, for this reason, the tool chain provides several methods to modify the quantization threshold logic, and then the default running on the CPU passive quantization or manual quantization operators can be configured to run on the BPU.

  • **runonbpu Features: ** Partial passive quantization operator and manual quantization operator, when the operator is found to runon the CPU after conversion and the operator conforms to the constraints of the BPU operator, you can configure the operator to runon the BPU by declaring the method of runonbpu (The passive quantization operator only partially supports the runonbpu function, and subsequent toolchain versions will gradually support it).

  • ** Insert unitconv operator: ** Insert a unitconv operator before/after the passive quantization operator when the original training framework is built and the model is derived. Since the conv operator is a typical active quantization operator and does not change the input and output (no retraining is required), inserting the unitconv operator before or after the passive quantization operator can ensure that the context before and after the passive quantization operator has the active quantization conv operator. The passive quantization operator is then passively followed by quantization (see the conv+concat structure above for a detailed example). A detailed description of the unitconv operator can be found in the Toolchain community article, and the following is a typical unitconv written under the pytorch framework:

    class IdentityUnitConv(torch.nn.Module):
    def init(self, channels):
    super().init()
    self.channels = channels
    self.identity_conv = torch.nn.Conv2d(
    channels, channels, 1, groups=channels, bias=False)
    torch.nn.init.dirac_(
    self.identity_conv.weight.data, groups=channels)
    self.check_equal()

    def check_equal(self):
        random_data = torch.randn(1, self.channels, 32, 32)
        result = self.forward(random_data)
        np.testing.assert_allclose(
            to_numpy(random_data), to_numpy(result),
            rtol=1e-02, atol=1e-03)
        print("check Identity, pass!")
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Identity with 4-D dataflow, input == output."""
        return self.identity_conv(x)
    

Summary

Based on the characteristics of BPU chip, the calculation characteristics of each operator and the experimental data of precision, three kinds of internal quantization logic are designed in the model conversion tool chain. This article introduces the internal quantization logic of the Horizon model transformation toolchain, explains what active quantization is, passive quantization is, and manual quantization is, and how quantization logic is embodied in the use of the toolchain: why after model transformation, there are occasionally individual operators that clearly comply with the operator support constraints but run on the CPU. With the continuous improvement of the horizon model transformation tool chain, the continuous iterative optimization of algorithms and the update of BPU architecture, the quantization strategies of different op will be dynamically adjusted.