Neural Network Quantization Background

1. Introduction

Over the past decade, the accuracy of neural network reasoning for tasks such as image classification, object detection, object tracking, instance segmentation, etc. has improved significantly, often through highly over-parameterized models. But the question arises: Will the continuous increase in the depth of the network also increase the accuracy of the classification task? Can complex structure and large number of parameters also improve the characterization performance of neural networks? Based on the above problems, convolutional neural networks have experienced the development stage of increasingly complex structure and increasingly diverse parameters. The number of layers of VGG network has developed to an astonishing 16 to 19 layers in a short time, and GoogleNet up to 22 layers has been studied.

However, the excellent performance of neural networks comes at the cost of high computational complexity and huge memory consumption. The top half of the table above lists the model sizes of some commonly used networks, and the computation and accuracy of reasoning a standard 224 × 224 size RGB image. As can be seen from the table: with the development of the network, its scale is gradually expanding, the number of network parameters is increasing, and the amount of computation is also increasing. This makes it difficult to deploy in some resource-constrained devices, greatly limiting the application scenarios of neural networks. For example, in the Internet of Things (IoT) and other scenarios, embedded systems not only lack rich computing and storage resources, but also have a very urgent need for low power consumption. In addition, applications such as automatic driving have high real-time requirements for network reasoning, otherwise there will be security risks. Therefore, a variety of deep neural network compression and acceleration techniques have emerged, trying to remove the redundancy in the network while ensuring the accuracy of the network, that is, to find a good trade-off between network performance and computing costs. Therefore, there is an urgent need for techniques to optimize models to reduce model size, achieve lower power consumption and faster inference speed. The deep learning field has done a lot of research on these issues, mainly in two areas: Design more efficient network architectures with acceptable accuracy at relatively small model sizes, such as MobileNetV1, MobileNetV2, and ShufflenetNetV2 in the table above.

  • Reduce the network scale through compression and coding. Quantization is one of the compression methods widely used in industry. Often, both strategies can be used together to achieve impressive results. For example, TensorFlow quantified MobileNetV1 at just 4.8MB, which is even smaller than most GIFs, making it easy to deploy on any mobile platform.

2. Hardware background

Before diving into the technical details, let’s first understand the hardware background of neural network quantization and how it enables efficient reasoning on the device. Neural network is an algorithm that uses computers to simulate the way of human thinking in the biological nervous system. Its basic unit is artificial neurons, which form a neural network by connecting with each other. The following diagram shows the computational structure of a single neuron:

  • x1,x2,…,xnx_1,x_2,… ,x_nx1,x2,…,xnindicates input;
  • w1,w2,…,wnw_1,w_2,… ,w_nw1,w2,…,wnindicates the weight parameter;
  • bbbindicates the offset item. In the computation diagram of a neural network, the computation of each neuron is combined into a weighted sum of the form ∑i=1nwixi+b\sum_{i=1}^ nw_ix_i +b∑i=1nwixi+b, and the output of that node is obtained by nonlinear activation, which can also be used as the input of the next neuron. It can be seen that the basic operation of the neural network algorithm is matrix multiplication, so the hardware computing unit should improve the efficiency of neural network reasoning by performing as many calculations as possible in parallel. The following diagram shows the calculation of matrix multiplication y=Wx+by=Wx+by=Wx+bby a single neuron in the hardware accelerator:

The neural network accelerator consists of two basic components: processing element Cn,mC_{n,m}Cn,mand accumulator AnA_nAn. First load the accumulator with the offset value bnb_nbn, then load the weight Wn,mW_{n,m}Wn,mand input xm\mathbf{x}_mxminto the array, and calculate their product in the corresponding processing element Cn,m=Wn,mxmC_{n,m}=W_{n,m}\mathbf{x}_mCn,m=Wn,mxm. Finally add the result to the accumulator AnA_nAn:

An=bn+∑mCn,mA_n=b_n+\sum_{m}C_{n,m}An=bn+m∑Cn,m

The above operation is also known as multiplicative Accumulate (MAC). For larger matrix vector multiplication, the operations performed at the hardware level are:

  1. Repeated MAC operation, once the loop operation is completed, the data value in the accumulator will be moved back to the memory for the next neural network layer;
  2. Neural networks are usually trained using FP32 type weight parameters and activations, and if inference is performed based on FP32, the processing element and accumulator will have to support floating-point logic and will need to transfer 32-bit data from memory to the processing unit;
  3. MAC operation and data transmission consume a lot of energy consumed during neural network reasoning. Therefore, clear benefits can be gained by using low fixed point or quantitative representations of these quantities. Low-level fixed-point representations, such as INT8, not only reduce the amount of data transferred, but also reduce the size and energy consumption of MAC operations. Compared with floating-point arithmetic, the hardware structure of integer adder is much simpler, which also reduces power consumption to a certain extent. This is because the addition and subtraction of floating-point numbers does not simply add mantissa together, but takes into account the order code. The method of adding and subtracting floating-point numbers is to take the numbers with large absolute values as the base, move the numbers with small absolute values, and then add and subtract. Therefore, the hardware module must first select the subtraction and the subtraction by comparing the size, select the order code and symbol, and then add the shift mantissa.

In order to move from floating-point operations to efficient fixed-point operations, a scheme is needed to convert floating-point vectors to integers, namely quantization operations. To prevent overflows, the accumulator is still floating-point, so quantization of the offset value bnb_nbnis generally not considered. The figure above shows the changes of the neural network accelerator after the introduction of quantization operation. The quantization operation can be expressed as:

A^n=∑mW^n,mx^m=∑m(swWn,mint)(sxxm)\hat{A}_n =\sum_m \hat{W}_{n,m} \hat{\mathbf{x}}_m =\sum_m (s_wW_{n,m}^{int} ) ({s_x\mathbf{x}}_m)A^n=m∑W^n,mx^m=m∑(swWn,mint)(sxxm)

Where swandsxs_w and s_xswandsxrepresent weights and input floating-point conversion factors respectively, the example uses INT8 quantization for weights and activations, and the values stored in the 32-bit accumulator need to be written to memory before they can be used by the next layer. In order to reduce the complexity of data transmission and the operation of the next layer of neurons, these values are re-quantified back to INT8 type. Quantization computes and stores tensors with bit-widths lower than floating-point precision, and quantization models perform some or all operations on tensors with reduced precision rather than full precision (floating-point) values, making the representation of models more compact and allowing high-performance vectorization operations on many hardware platforms. In neural network quantization, weights and activations are usually saved to low bit accuracy instead of FP16 or FP32 for training. The industry often chooses INT8 quantization, from FP32 to INT8, to achieve performance improvements in the following points:

  • Model storage is reduced by 4 times;
  • Memory bandwidth is reduced by 2-4 times;
  • Inference is 2-4 times faster due to memory bandwidth savings and faster computation of the INT8 algorithm (exact acceleration varies by hardware, runtime, and model).

3. Quantify basic concepts

3.1 floating-point quantization Since quantization Bridges fixed point and floating point, it is necessary to understand their basics before touching on related research and solutions. Both fixed-point and floating-point are numerical representations, and they differ in the points that separate the integer part from the fractional part and where the points are. Fixed-point holds a specific number of integers and decimals, while floating-point holds a specific number of significant and exponents.

The format and examples of fixed-point and floating-point representations are given in the figure above. For fixed points, I represents the integer and F represents the IIIII.FFFFF minor part. For floating point numbers, base =2, 10, and 16 correspond to binary, decimal, and hexadecimal, respectively. In the Instruction Set Architecture’s built-in data types, fixed points are integers and floating points are binary. In general, a fixed point at the instruction set level is continuous because it is an integer and the gap between two adjacent representable numbers is 1. Floating-point, on the other hand, represents real numbers whose gaps are determined by exponents and thus have a very wide range of values. The single precision value of FP32 has 4 bytes, including a sign bit, an 8-bit binary exponent, and a 23-bit manmanda, so the maximum integer value of a 32-bit value is 232−12^{32}-1232−1, and the floating point range is [(2−223)×2127.(223−2)×2127][(2-2^{23}) \times2^{127}. (2^{23}-2) \times2^{127}][(2−223)×2127.(223−2)×2127], the closer the value is to zero, the more accurate it is. At a given exponent, floating point has the same number of values in different ranges. Therefore, converting a network from FP32 to INT8 is not as simple as data type conversion truncation.

Fortunately, the value distribution of the weight of the neural network is very limited and very close to zero. The figure above shows the weight distribution of the 10 layers in MobileNetV1 (the layer with the most values). The value range is (− 1,1), and quantizing floating-point is mapping FP32 to INT8 using a method like xfloat=xscale×xquantizedx_{float}=x_{scale}\times x_{quantized}xfloat=xscale×xquantized, where xfloatx_{float}xfloatrepresents FP32, xquantizedx_{quantized}xquantizedrepresents the quantized INT8 weight, and xscalex_{scale}xscaleis the quantized scaling factor. 3.2 Uniform quantization and non-uniform quantization Quantization is the process of approximating a continuous signal with a set of discrete symbols or integer values. The simplest and most direct method is to map the continuous values linearly to the nearest integer values by affine transformation. Floating-point numbers can be quantized in discrete space after scaling, rounding, offset and overflow protection operations, which is generally called linear quantization or uniform quantization. Because it is easy to implement, it is the most widely used. When quantizing uniformly to n bits, if an unsigned integer is used, the range of quantized values is {0,…2n−1}\{0,… 2^{n}-1\}{0,…2n−1}; If a signed integer is used, its quantized value ranges from {−2n−1,2n−1−1}\{-2^{n-1},2^{n-1}-1\}{−2n−1,2n−1−1}integers, as shown on the left side of the figure below (yellow dots are quantized fixed point values).

The operation of uniform quantization is as follows:

xfloat=xscale×(xquantized−xzero_point)x_{float}=x_{scale}\times (x_{quantized}-x_{zero\_point})xfloat=xscale×(xquantized−xzero_point)

In most cases, an unsigned integer is used for quantization, so the INT8 range is [0,255]. The zero offset xzero_pointx_{zero\_point}xzero_pointmakes more sense in this case, and specifically, quantizing floating-point values can be divided into two steps, as shown in the equation below:

xfloat∈[xfloatmin,xfloatmax]x_{float} \in [x_{float}^{min},x_{float}^{max}]xfloat∈[xfloatmin,xfloatmax]

xscale=xfloatmax−xfloatminxquantizedmax−xquantizedminx_{scale}=\frac {x_{float}^{max}-x_{float}^{min}}{x_{quantized}^{max}-x_{quantized}^{min}}xscale=xquantizedmax−xquantizedminxfloatmax−xfloatmin

xzero_point=xquantizedmax−xfloatmax/xscalex_{zero\_point}={x_{quantized}^{max}-x_{float}^{max}}/x_{scale}xzero_point=xquantizedmax−xfloatmax/xscale

xquantized=xfloat/xscale+xscalex_{quantized}=x_{float}/x_{scale}+x_{scale}xquantized=xfloat/xscale+xscale

  1. Determine xscalex_{scale}xscaleand xzero_pointx_{zero\_point}xzero_pointby finding the min and max values in the weight tensor (FP32).

  2. Convert each value of the weight tensor from FP32 to INT8. ** Note **: When the result of a floating-point operation is not equal to an integer, an additional rounding step is required. For example, mapping the FP32 range [−1,1] to the INT8 range [0,255] has xscalex_{scale}xscale=2/255, and xzero_pointx_{zero\_point}xzero_point=255-255/2 is approximately 127. Uniform quantization can be expressed by the following formula:

    xint=clamp(round(xfscale+zero_point),−2n−1,2n−1−1)x_{int}=clamp(round(\frac{x_{f}}{scale}+zero\_point),-2^{n-1},2^{n-1}-1)xint=clamp(round(scalexf+zero_point),−2n−1,2n−1−1)

    clamp(x,min,max)={min,x≤minx,min≤x≤max,max,x≥maxclamp(x,min,max)=\begin{cases} min, x\leq min\\x,min\leq x \leq max, \\max,x\ge max \end{cases}clamp(x,min,max)=⎩⎨⎧min,x≤minx,min≤x≤max,max,x≥max

    Among them:

  • xfx_{f}xfis the original floating-point number, and xintx_{int}xintis the quantized fixed-point value
  • scale is quantization coefficient;
  • zero_point indicates the offset, indicating the quantized fixed-point value corresponding to the floating point value 0.
  • round is the integer function; -n indicates the quantization bit width. For example, if int8 quantizes, n=8.
  • clamp is a clamp function that limits the value of increasing, decreasing or changing randomly within a certain range;

Non-uniform quantization uses some nonlinear functions, such as logarithmic distribution, k-means clustering, etc., to determine the correspondence between the original floating point number and the fixed point value. Compared with uniform quantization, non-uniform quantization can adopt a suitable mapping method according to the characteristics of the original data distribution and can better maintain accuracy, but it is difficult to deploy on hardware.

3.3 Symmetric quantization and asymmetric quantization

For uniform quantization, the research core of different quantization algorithms is how to determine quantization scale coefficient and offset zero_point, so as to ensure the compression effect of the model as much as possible while reducing the introduction of quantization errors. The calculation formula of scale is as follows:

Scale= fracbetaalpha2n−1Scale = \ frac {beta alpha} {2} ^ n - 1Scale= fracbetaalpha2n−1

[α, β] is the intercept range of the original floating point value, n is the quantization bit width, and scale determines the number of quantized partitions (a total of 2n−12^n-12n−1partitions). The process of determining the intercept range is also called the calibration process, the simplest way is to directly take the maximum and minimum value of the original value, that is, α=xminα=x_{min}α=xmin, β=xmaxβ=x_{max}β=xmax, usually this is an asymmetric quantization method, because α≠β, the following is the diagram of asymmetric quantization:

Since asymmetric quantization needs to add the offset value zero_point (because 0 often has a special meaning, such as padding), which will increase the computational complexity, we usually use the value −α=β=max(∣xmin∣,∣xmax∣)-α=β=max(|x_{min}|, |x_{max}|)−α=β=max(∣xmin∣,∣xmax∣), which is also a symmetric quantization.

xint=round(xrscale)x_{int}=round(\frac{x_{r}}{scale})xint=round(scalexr)

Another way is to make the quantized interval [a,b] also not centered on zero, and satisfy ab=αβ\frac{a}{b}=\frac{α}{β}ba=βα, which is still a symmetric quantization, and does not need to introduce the zero offset value zero_point. However, when the distribution of weights and activation values is uneven (for example, the activation values after Relu are non-negative), asymmetric quantization can obtain a more accurate cutting range and avoid the effective information being compressed or even submerged. This direct min/max method is easily affected by abnormal data in the activation value, resulting in a larger truncation range and affecting the quantization accuracy. To solve this problem, percentile is usually used instead of min/max, that is, the i th maximum/minimum value in the feature plot is used instead of min/max, or β and α are used to minimize the kl divergence between the floating point and quantized values. 3.4 PTQ and QAT Regardless of the above method, the quantization operation approximates the data, and some high-frequency information is lost, which will inevitably introduce quantization errors. However, due to the robustness and fault tolerance of the neural network itself, a small amount of Quantization error will not affect the network performance, and only Post-training quantization (PTQ) can be used for deployment. However, if the quantization error is too large to exceed the toughness of the neural network itself, its performance will show a certain decline. At this time, network parameters need to be retrained to adapt to the changes in data distribution brought about by Quantization operations, and this Training process is generally called Quantization - Aware Training (QAT). The Horizon chip algorithm tool chain supports PTQ and QAT quantization methods. In addition, Horizon also supports the calculation of quantization/inverse quantization nodes at the front and back end of the model into the model pre and post processing to reduce the time consumption of repeated data traversal.

3.4.1 PTQ

PTQ is to use a batch of calibration data to calibrate the trained model, and convert the trained FP32 model directly to the fixed-point calculation model, without any training of the original model. Only a few hyperparameter adjustments can complete the quantization process, and the process is simple and fast, no training is required, so this method has been widely used in a large number of end-side and cloud-side deployment scenarios. However, since the purpose of PTQ is to quantify quickly, most methods focus on optimizing local errors for single-layer networks rather than considering the loss of the overall network task.

3.4.2 QAT

QAT is the quantification of the trained model and then re-training. A copy of the full precision weight is retained throughout the training process, which is quantized to the target integer data accuracy during the forward propagation process. The quantized weight is used to calculate the network output and loss function. When the weight is updated, the slight change of gradient is accumulated on the full precision weight, so that after several updates, the weight may be quantized to the adjacent quantization level. In the actual training process, pseudo-quantization is generally used, that is, all operations can still be calculated using full precision numbers, and the weight and activation values will be quantized to discrete values and then reversely quantized back to the original range. However, because it needs to train the model, it has high technical requirements for the operator.

4. Horizon PTQ quantization

From the perspective of optimization of end-side deployment of the chip, the quantization mode of Horizon chip algorithm tool chain is symmetric quantization by default, and asymmetric quantization is only searched by default calibration mode in a few cases. PTQ uses a batch of calibration data to calibrate the trained model to obtain the quantization threshold [α, β], and then quantization is performed using the quantization scale factor calculated from the quantization threshold. Assuming that the quantization bit width is n, scale is calculated as follows:

Scale= fracbetaalpha2n−1Scale = \ frac {beta alpha} {2} ^ n - 1Scale= fracbetaalpha2n−1

Common methods for quantization threshold selection are: **1. Weight quantification ** For weight quantization threshold, most of the default selection of a convolution check should be a quantization threshold T max method, the calculation formula is as follows:

T=max(abs(max),abs(min))T=max(abs(max),abs(min))T=max(abs(max),abs(min))

**2. Featuremap Quantization ** For featuremap quantization, Horizon’s PTQ quantization conversion tool hb_mapper makertbin provides a variety of calibration methods by setting calibration_type parameter in yaml file.

  • max calibration method The maximum value of featuremap (vmax) and the absolute value of minimum value (vmin) are selected as the threshold. The formula is as follows:

    T=max(abs(vmax),abs(vmin))T=max(abs(vmax),abs(vmin))T=max(abs(vmax),abs(vmin))

    In addition, can also in yaml configuration max_percentile parameters to adjust the Max intercept point calibration, common configuration options are: 0.99999/0.99995/0.99900/0.99990/0.99950.

  • kl divergence calibration method kl divergence is used to calculate the distance between the distribution P of the quantized int data and the distribution Q of the float. By traversing the possible truncation ranges and quantizing according to each truncation range, then comparing the int data distribution after quantization with the float data distribution before quantization, calculating the kl divergence, finding the truncation range with the smallest kl divergence as the threshold used in the final quantization. kl divergence is calculated as follows:

    KL(P,Q)=∑i=1nPi(log2Pi−log2Qi)KL(P,Q)=\sum_{i=1}^n P_i(log_{2}{P_i}-log_{2}{Q_i})KL(P,Q)=i=1∑nPi(log2Pi−log2Qi)

  • mix Calibration method mix is a search strategy that integrates multiple calibration methods, automatically identifies and quantifies sensitive nodes, selects the best method from different calibration methods in node granularity, and finally builds a combined calibration method that integrates the advantages of multiple calibration methods. The detailed process is as follows: **Step1: ** kl calibration method is used to calculate the quantization sensitivity of nodes in the current model (measured by cosine similarity), and nodes whose values are less than a specific threshold are defined as quantization sensitive nodes (nodes that have a greater impact on the quantization accuracy of the model). **Step2: ** Traverse all quantized sensitive nodes, try three calibration methods of Max, max-Percentile 0.99995 and KL on each node, and select the best calibration method for the node, and finally get the Mix calibration model. **Step3: ** Evaluate the cumulative error of Mix, Max, Max-Percentile 0.99995 and KL calibration models, and output the optimal model.

  • default Calibration method default is an automatic search strategy that attempts to obtain a relatively good combination of calibrated quantization parameters. The detailed process is as follows: **Step1: ** Three calibration methods, Max, max-Percentile 0.99995 and KL, were tried to calculate the cosine similarity respectively. If the highest cosine similarity among the three methods is less than 0.995, go to Step2; Conversely, the threshold combination corresponding to the highest similarity is returned. **Step2: ** Try the combination method of Max-Percentile 0.99995 and perchannel quantization, if the highest cosine similarity of the four methods is less than 0.995, go to Step3; Conversely, the threshold combination corresponding to the highest similarity is returned. **Step3: ** Select the method corresponding to the highest cosine similarity in Step2, apply asymmetric quantization as the fifth method, select the best scheme among the five schemes according to the cosine similarity, and return the corresponding threshold combination.

References

  1. Nagel M, Fournarakis M, Amjad R A, et al. A white paper on neural network quantization[J]. arXiv preprint arXiv:2106.08295, 2021.
  2. 李博闻. 深度神经网络量化及其硬件加速研究[D].浙江大学,2022.DOI:10.27461/d.cnki.gzjdx.2022.000973.