Foreword
padding is an important step when writing C++ code for on-board deployment. This is to make the input data conform to the span alignment rules of the BPU, making the calculation of the BPU more efficient. The padding of the image data is different from that of the featuremap data. The padding of the image data is simple, and the padding can be performed automatically by invoking the board-side model inference library.
Image data
Common image data include Y/NV12/NV12_SEPARATE/YUV444/RGB/BGR. Meanwhile, this chapter will explain the 00_quick_start example in the OE package ddk/samples/ai_toolchain/horiozn_runtime_sample/code directory, which will be introduced in detail in the article “Getting Started with Model Reasoning” in the near future. In the 00_quick_start example, the function prepare_tensor parses information about the model input nodes and applies for alignedByteSize BPU memory. The read_image_2_tensor_as_nv12 function reads the input image and stores the image data into this BPU memory. At this point, the input data in the BPU memory is stored as follows:
In Figure 1, alignedByteSize is the BPU memory space allocated by the prepare_tensor function, where the image data is stored as memcpy. At the end of the memory, there is some space that is not stored in the data, and this space is reserved for the on-board inference library to do padding. The span alignment rules for the BPU are as follows:
Considering that memory stores data in one dimension, we can expand this two-dimensional graph into one dimension, which would look like this:
So how did it go from Figure 1 to Figure 3? In the prepare_tensor function, there is this line of code:
input[i].properties.alignedShape = input[i].properties.validShape;
During on-board reasoning, when the model reasoning library detects that alignedshape is equal to validshape, it determines that users aren’t actively padding the input data, and the reasoning library automatically padding according to the corresponding padding rules of different data types, converting Figure 1 to Figure 3. If you represent Figure 3 in two dimensions, you revert to the familiar alignment of Figure 2. In other words, if the input data itself is padding according to the BPU alignment rules, or the input data itself is aligned according to the BPU alignment rules, then the size of the input image is already equal to alignedByteSize, and there is no more space after memcpy. Figure 1 directly follows the alignment of Figures 2 and 3. ** Summary: When the image data is used as the model input, the user does not need to take the initiative to do the padding, but only needs to add the above line of code after allocating the BPU memory, so that the model inference library on the board can automatically complete the padding operation. **
featuremap data
When featuremap data is input, the board reasoning library cannot padding automatically. You need to write specific code according to the BPU alignment rules to padding. Starting with version 1.1.49b of the J5 algorithm toolchain, and version 2.5.2 (gcc-9.3.0) and 1.16.2c (gcc-6.5.0) of the XJ3 algorithm toolchain, The horizon_runtime_sample example code for padding is added to featuremap data. Developers can go to the ddk/samples/ai_toolchain/horizon_runtime_sample/code/03_misc/resnet_feature/src directory of the OE package. Open the run_resnet_feature.cc code to learn.
First, it’s important to note that not all of featuremap’s input requires manual padding. If the first operator entered by featuremap is a CPU operator (such as quantization), padding is not required because the span alignment rules apply only to the BPU and not the CPU. In addition, even if the first operator in featuremap’s input is the BPU operator, padding is not required if the input already meets the span alignment requirement. To do the padding, featuremap input must meet both of the following conditions:
- The first operator in featuremap must be the BPU operator.
- featuremap does not meet the span alignment requirement. The last dimension of the data layout is not aligned with 16 bytes. Next, we’ll look at how the horizon_runtime_sample example aligns the featuremap input. The model used in the example is a single-input featuremap model. The original quantization operator in the input node of the model has been deleted, so the first operator is directly the BPU operator. The input data of the model requires int8 data type, data arrangement is NCHW, and dimension is 1x64x56x56.
Because the first operator in the input side is the BPU operator and the last dimension is 56 bytes, it does not meet the 16-byte alignment requirement. Therefore, the preceding two conditions require manual padding. Before the reasoning starts, we need to padding the W dimension of featuremap to 64 bytes. Similar to the image data step, the padding step is done in the prepare_feature_tensor function. In the prepare_feature_tensor function, after allocating the aligned memory space for the input tensor, we use the tensor_padding_feature function to do padding. We focus on the interpretation of this function. The following shows some key code of this function:
float *feature_data = reinterpret_cast<float *>(data);
float scale = tensor_property.scale.scaleData[0];
int *stride = tensor_property.stride;
int8_t *tensor_data = reinterpret_cast<int8_t *>(tensor->sysMem[0].virAddr);
// do quantize and padding
if (tensor_property.tensorLayout == HB_DNN_LAYOUT_NCHW) {
// for j5 NCHW
for (int n = 0; n < batch; n++) {
for (int c = 0; c < channel; c++) {
for (int h = 0; h < height; h++) {
auto *raw =
tensor_data + n * stride[0] + c * stride[1] + h * stride[2];
for (int w = 0; w < width; w++) {
*raw++ = int_quantize(*feature_data++, scale, 0, -128.f, 127.f);
}
}
}
}
}
In the tensor_padding_feature function, NCHW and NHWC are distinguished according to the data layout. NCHW must meet 16-byte alignment in the W dimension, and NHWC must meet 16-byte alignment in the C dimension. The alignment operations of the two data configurations are similar, and since the example model is NCHW, we will analyze the NCHW. The code sets 4 layers of for loops for N, C, H and W in turn. Considering that the quantization node of the model is deleted, the int_quantize function is written in the innermost W dimension loop to complement the quantization calculation. padding operation is implemented in the loop of H, stride means the number of bytes that need to be crossed in memory every time the current dimension is increased by 1. In this example, N is stride[0]=64x56x64=229376, and C is stride[1]=64x56=3584. H’s stride[2]=64, W’s stride[3]=1, where W’s stride is not used. Since the stride calculation uses the most aligned data, every time the W dimension has completed 56 valid quantization calculations and assigned to *raw, the *tensor_data pointer of the H dimension will span 64 bytes on the original basis. It is equivalent to let the W dimension jump 64-56=8 bytes and then continue to do quantization and assign the result to *raw, so as to achieve the W dimension alignment to 64 bytes, to meet the requirements of 16 bytes alignment, 8 bytes is the padding part, without quantization calculation. After all alignment operations are performed, the data in the BPU memory is distributed as follows:
At this point, the user-authored steps are over. padding is left for the input data, but no zeros are assigned to these Spaces, because zeros are assigned to the board inference library itself during model inference, and do not need to be written by the user. The stride used in the tensor_padding_feature function is derived from the hbDNNTensorProperties structure, which is distinguished from the concept of stride. stride refers to the number of bytes that need to be increased in memory for every 1 increase in the value of this dimension. A stride is the number of bytes that need to be aligned /padding to a dimension. ** Summary: When featuremap data is used as the model input, the user needs to write complete code to actively do padding. After allocating the BPU memory, the user writes a for loop according to the data layout type. In the second for loop from the inside out, the space requiring padding is skipped. If the model deletes the quantization operator on the input side, the quantization calculation can be supplemented in the innermost loop. **