Data Arrangement and Span Alignment

1 Data layout

1.1 Concept of data arrangement

In the deep learning framework, the feature graph is usually presented in the form of a four-dimensional array, the four dimensions are: batch size N, the number of feature graph channels C, the feature graph height H, and the feature graph width W. Data Layout refers to the arrangement of these four dimensions, usually NHWC and NCHW two. Although from a human perspective, NHWC and NCHW are both four-dimensional data, for computers, data storage is linear, so four-dimensional data will be stored in a one-dimensional form, the difference between NHWC and NCHW lies in the storage rules of four-dimensional data in memory. It should be noted that the concepts of NHWC and NCHW do not apply to the NV12 (YUV420) data type, because there are four Y components corresponding to one set of UV components, so there is no concept of a channel.

1.2 NHWC

For a 2x2 size RGB image, if the data arrangement is NHWC, it will be stored in the memory in the order of C, W, H and N, and pixels in the same position of different channels will be stored together, as shown in the following figure:

1.3 NCHW

If the data arrangement of the 2X2 size RGB image is NCHW, it will be stored in the memory in the order of W, H, C, N, that is, all R is stored first, then all G is stored, and finally all B is stored, as shown in the following figure:

1.4 Support

PyTorch, Caffe, and PaddlePaddle deep learning frameworks use the NCHW format. TensorFlow uses NHWC by default, but the GPU version can support NCHW. For Horizon chip algorithm toolchain, the model trained by NCHW and NHWC data arrangement can be converted and compiled normally.

2 Span alignment

2.1 Concept of span

Stride refers to the actual size of the space occupied by each line of an image when it is stored in memory. Most of the computer’s processors are 32-bit or 64-bit, so the amount of complete data read by the processor at a time is preferably a multiple of 4 bytes or 8 bytes, if it is other values, the computer needs special processing, resulting in reduced operating efficiency. In order for the computer to process images efficiently, it is often necessary to fill in some additional data on top of the original data to achieve 4-byte or 8-byte alignment. The actual alignment is also called Padding, and the actual alignment rules depend on the software and hardware.

Suppose we have an 8-bit deep grayscale image with a Height of 20 pixels and a Width of 30 pixels, then the effective data of each line of the image is 30 bytes. If the alignment rule of the computer is 8 bytes, then the span of the aligned image is 32 bytes, and the amount of Padding data required for each line is 2 bytes.

2.2 BPU span alignment

The above content is only a general introduction to the span rules, for the Horizon Journey, the BPU of the Rising Sun series chip, there are special span alignment rules. For example, for NV12 input, if H and W are even, the Width should be aligned in multiples of 16 bytes. The span alignment of the BPU has different rules for different data configurations and data types. alignedShape = input[i].properties.validShape;). The alignment of the image data is done automatically at the end of the board by the Model Inference predictor library. You only need to allocate the BPU memory according to the aligned bytes when writing deployment code. For details about how to align featuremap data, see horizon_runtime_sample/code/03_misc/resnet_feature in the OE package. The aligned byte size can be obtained directly by reading the model parameters, so it is very convenient to use.

typedef struct {
  hbDNNTensorShape validShape; 
  hbDNNTensorShape alignedShape;
  int32_t tensorLayout;
  int32_t tensorType;
  hbDNNQuantiShift shift;
  hbDNNQuantiScale scale; 
  hbDNNQuantiType quantiType;
  int32_t quantizeAxis;
  int32_t alignedByteSize;    
  int32_t stride[HB_DNN_TENSOR_MAX_DIMENSIONS];
} hbDNNTensorProperties;

The C++SDK of the toolchain, hbDNNTensorProperties, contains detailed information about the model’s input/output tensors. validShape is the effective size of the data, alignedShape is the aligned size, and alignedByteSize is the size of the aligned bytes. The proper use of this data can make writing code more efficient, and more information about this can be found in the BPU SDK API section of the Toolchain manual.

2.3 Remove alignment

Alignment is to take care of the image reading performance of the hardware and software system, and after the completion of the calculation task, the alignment needs to be removed and only the valid data is retained. If the model ends in a BPU node, you need to write code to skip the padding data. (Use hrt_model_exec model_info to view alignedShape and validShape for the model input and output.) If there are CPU nodes in the tail of the model, the alignment between the BPU and the CPU is automatically removed during data transmission. For a model whose tail is a BPU node, you need to manually remove the alignment data. For a model with CPU nodes trailing, alignment removal is performed automatically without manual user intervention.