Get Started with Model Inference Quickly

1 Preface

After completing the transformation and compilation of the model, the model file can be deployed on the development board, with bin or hbm as the suffix name, the difference is: bin (Binary) models are mixed heterogeneous models that can contain both CPU operators and BPU operators, while hbm (Horizon BPU Model) contains only BPU operators. The two models have no difference in on-board deployment. They can use the same inference library BPU SDK API, and both support hrt_bin_dump and hrt_model_exec tools for analysis.

2 Example introduction

The ddk/samples/ai_toolchain/horizon_runtime_sample directory of the OE package contains a large number of basic examples of on-board deployment. code contains source code and compile-related files. scrpit contains executable files generated by running scripts and compilations, and presets data and related models. The corresponding model inference example can be executed by running the script directory on the development board. The quick start example highlighted in this article is 00_quick_start, which runs the mobilenetv1_224x224_nv12.bin classification model, reads a jpg image, performs a forward inference, and calculates the Top5 classification results in post-processing.

Before writing the on-board deployment code, developers need to be familiar with the on-board deployment API provided by Horizon. In this part, you can check the BPU SDK API chapter of the Toolchain manual. This chapter not only introduces the API interface in detail, but also comprehensively introduces the data types and data interfaces related to on-board deployment, as well as data arrangement and alignment rules. Error codes and so on. You can also read the sample code while scrolling through the API manual to learn.

3 Program structure

This diagram shows the six main steps of the main function in the code. The solid arrows represent the steps of the main function and the dashed arrows represent the custom functions that need to be called at that step. The specific implementation of the functions is outside the main function.

4 Code interpretation

The path to the quick start code in the OE development package is: ddk/samples/ai_toolchain/horizon_runtime_sample/code/00_quick_start/src/run_mobileNetV1_224x224.cc

DEFINE_string(model_file, EMPTY, "model file path");
DEFINE_string(image_file, EMPTY, "Test image path");
DEFINE_int32(top_k, 5, "Top k classes, 5 by default");

The string defined here represents the command line parameter definition of the script run_mobilenetV1.sh, that is, the input parameters to be parsed, including the model file, the image file, and the parameter setting of the classification result TopK.

#define HB_CHECK_SUCCESS(value, errmsg)                              \
  do {                                                               \
    /*value can be call of function*/                                \
    auto ret_code = value;                                           \
    if (ret_code != 0) {                                             \
      VLOG(EXAMPLE_SYSTEM) << errmsg << ", error code:" << ret_code; \
      return ret_code;                                               \
    }                                                                \
  } while (0);

HB_CHECK_SUCCESS is used to determine whether the function is successfully executed. Enter the specific function to be executed in the value field. If the function fails to be executed, an error code will be returned and displayed on the terminal. You can also use the hbDNNGetErrorDesc interface to print the reason for the error.

typedef struct Classification {
  int id;
  float score;
  const char *class_name;

  Classification() : class_name(0), id(0), score(0.0) {}
  Classification(int id, float score, const char *class_name)
      : id(id), score(score), class_name(class_name) {}

  friend bool operator>(const Classification &lhs, const Classification &rhs) {
    return (lhs.score > rhs.score);
  }

  ~Classification() {}
} Classification;

The Classificaton structure mainly defines three variables, namely, classification sequence number id, classification score, and class name class_name, which will be used in post-processing TopK calculation. The > operator overloaded by the friend function is to coordinate with the priority setting of the priority queue in TopK calculation. This will be explained in detail later when we analyze the TopK code.

Let’s skip the function declaration and go straight to the main function.

// Parsing command line arguments
  gflags::SetUsageMessage(argv[0]);
  gflags::ParseCommandLineFlags(&argc, &argv, true);
  std::cout << gflags::GetArgv() << std::endl;

  // Init logging
  google::InitGoogleLogging("");
  google::SetStderrLogging(0);
  google::SetVLOGLevel("*", 3);
  FLAGS_colorlogtostderr = true;
  FLAGS_minloglevel = google::INFO;
  FLAGS_logtostderr = true;

  hbPackedDNNHandle_t packed_dnn_handle;
  hbDNNHandle_t dnn_handle;
  const char **model_name_list;
  auto modelFileName = FLAGS_model_file.c_str();
  int model_count = 0;

This code uses the gflags interface to parse the file information read by the script and assign it to a specific variable.

// Step1: get model handle
  {
    HB_CHECK_SUCCESS(
        hbDNNInitializeFromFiles(&packed_dnn_handle, &modelFileName, 1),
        "hbDNNInitializeFromFiles failed");
    HB_CHECK_SUCCESS(hbDNNGetModelNameList(
                         &model_name_list, &model_count, packed_dnn_handle),
                     "hbDNNGetModelNameList failed");
    HB_CHECK_SUCCESS(
        hbDNNGetModelHandle(&dnn_handle, packed_dnn_handle, model_name_list[0]),
        "hbDNNGetModelHandle failed");
  }

Step1 uses three apis: initializing the model from a file, getting the name and quantity of the model, and getting the model handle. It involves the concept of “pack”, here is an explanation: Toolchain supports the use of the hb_pack tool to consolidate multiple converted bin models into a single file (see Toolchain manual 4.1.1.9 “Other Model Tools” for details), if the hbDNNInitializeFromFiles interface is parsing a single model that is not packaged, Then packed_dnn_handle refers to the model. If the interface is parsing the packaged integrated model, packed_dnn_handle refers to multiple models, and the model_name_list list contains all the models. model_count is the total number of models.

std::vector<hbDNNTensor> input_tensors;
  std::vector<hbDNNTensor> output_tensors;
  int input_count = 0;
  int output_count = 0;
// Step2: prepare input and output tensor
  {
    HB_CHECK_SUCCESS(hbDNNGetInputCount(&input_count, dnn_handle),
                     "hbDNNGetInputCount failed");
    HB_CHECK_SUCCESS(hbDNNGetOutputCount(&output_count, dnn_handle),
                     "hbDNNGetInputCount failed");
    input_tensors.resize(input_count);
    output_tensors.resize(output_count);
    prepare_tensor(input_tensors.data(), output_tensors.data(), dnn_handle);
  }

input_tensors and output_tensors are input and output tensors defined in terms of vector, whose data type is hbDNNTensor encapsulated in horizon. input_count and output_count are used to get how many input and output nodes the model has. Then go to Step2, call hbDNNGetInputCount and hbDNNGetOutputCount interfaces, obtain the number of input and output nodes of the model, and initialize input_count and output_count. Then set the input tensor input_tensors and the output tensor output_tensors to their corresponding lengths, and use the prepare_tensor function to allocate memory space. Next, focus on the prepare_tensor function (excerpted in the input section) :

int prepare_tensor(hbDNNTensor *input_tensor,
                   hbDNNTensor *output_tensor,
                   hbDNNHandle_t dnn_handle) {
  int input_count = 0;
  int output_count = 0;
  hbDNNGetInputCount(&input_count, dnn_handle);
  hbDNNGetOutputCount(&output_count, dnn_handle);

  hbDNNTensor *input = input_tensor;
  for (int i = 0; i < input_count; i++) {
    HB_CHECK_SUCCESS(
        hbDNNGetInputTensorProperties(&input[i].properties, dnn_handle, i),
        "hbDNNGetInputTensorProperties failed");
    int input_memSize = input[i].properties.alignedByteSize;
    HB_CHECK_SUCCESS(hbSysAllocCachedMem(&input[i].sysMem[0], input_memSize),
                     "hbSysAllocCachedMem failed");
    input[i].properties.alignedShape = input[i].properties.validShape;
    const char *input_name;
    HB_CHECK_SUCCESS(hbDNNGetInputName(&input_name, dnn_handle, i),
                     "hbDNNGetInputName failed");
    VLOG(EXAMPLE_DEBUG) << "input[" << i << "] name is " << input_name;
  }
  //output
  ......
  return 0;
}

The main purpose of prepare_tensor is to allocate the appropriate memory space for the input and output tensors. If the model has multiple inputs and outputs, memory space is allocated once for each input and output. HbDNNGetInputTensorProperties and hbDNNGetOutputTensorProperties for parsing from the model input and output properties of the tensor, input_memSize and output_memSize represent the aligned byte size of an input/output tensor. One of those statements is: input[i].properties.alignedShape = input[i].properties.validShape; Allows the board inference library to automatically align input data for image types (excluding the featuremap type). About the alignment of specific methods for the input data, can view the community article the padding] [in the deployment for the input data do (https://developer.horizon.cc/forumDetail/163806988210447365), Specific alignment rules parsing can view the [model input and output alignment rules] (https://developer.horizon.cc/forumDetail/118364000835765837). The alignedShape and validShape of the model can be viewed on the board using the model_info feature of the hrt_model_exec tool. After that, the hbSysAllocCachedMem interface is used to allocate memory space with Cache. With cache, CPU read and write efficiency is much higher than without cache. The name of the current input/output node is printed after the memory space is allocated.

// Step3: set input data to input tensor
  {
    HB_CHECK_SUCCESS(
        read_image_2_tensor_as_nv12(FLAGS_image_file, input_tensors.data()),
        "read_image_2_tensor_as_nv12 failed");
    VLOG(EXAMPLE_DEBUG) << "read image to tensor as nv12 success";
  }

Going back to Step3 of the main function, after allocating the memory space for the input and output tensors, you need to use the read_image_2_tensor_as_nv12 function to store the image data into the memory space corresponding to the input tensor. This example reads rgb images from a local file and converts them to nv12 as input to the model. If dealing with images from the video path input, you can refer to the full process example, namely ddk/samples/ai_forward_view_sample of the OE package.

int32_t read_image_2_tensor_as_nv12(std::string &image_file,
                                    hbDNNTensor *input_tensor) {
  hbDNNTensor *input = input_tensor;
  hbDNNTensorProperties Properties = input->properties;
  int tensor_id = 0;
  // NCHW , the struct of mobilenetv1_224x224 shape is NCHW
  int input_h = Properties.validShape.dimensionSize[2];
  int input_w = Properties.validShape.dimensionSize[3];

  cv::Mat bgr_mat = cv::imread(image_file, cv::IMREAD_COLOR);
  if (bgr_mat.empty()) {
    VLOG(EXAMPLE_SYSTEM) << "image file not exist!";
    return -1;
  }
  // resize
  cv::Mat mat;
  mat.create(input_h, input_w, bgr_mat.type());
  cv::resize(bgr_mat, mat, mat.size(), 0, 0);
  // convert to YUV420
  if (input_h % 2 || input_w % 2) {
    VLOG(EXAMPLE_SYSTEM) << "input img height and width must aligned by 2!";
    return -1;
  }
  cv::Mat yuv_mat;
  cv::cvtColor(mat, yuv_mat, cv::COLOR_BGR2YUV_I420);
  uint8_t *nv12_data = yuv_mat.ptr<uint8_t>();

  // copy y data
  auto data = input->sysMem[0].virAddr;
  int32_t y_size = input_h * input_w;
  memcpy(reinterpret_cast<uint8_t *>(data), nv12_data, y_size);

  // copy uv data
  int32_t uv_height = input_h / 2;
  int32_t uv_width = input_w / 2;
  uint8_t *nv12 = reinterpret_cast<uint8_t *>(data) + y_size;
  uint8_t *u_data = nv12_data + y_size;
  uint8_t *v_data = u_data + uv_height * uv_width;

  for (int32_t i = 0; i < uv_width * uv_height; i++) {
    if (u_data && v_data) {
      *nv12++ = *u_data++;
      *nv12++ = *v_data++;
    }
  }
  return 0;
}

In the input parameters of the read_image_2_tensor_as_nv12 function, image_file is the input image and input_tensor is the input tensor. The image read by opencv’s imread interface is of bgr type data, which is first resized into the size required by the model, and converted into yuv420 (i.e. nv12) format, and then copied the yuv420 image data into the memory space of the input tensor.

hbDNNTaskHandle_t task_handle = nullptr;
  hbDNNTensor *output = output_tensors.data();
  // Step4: run inference
  {
    // make sure memory data is flushed to DDR before inference
    for (int i = 0; i < input_count; i++) {
      hbSysFlushMem(&input_tensors[i].sysMem[0], HB_SYS_MEM_CACHE_CLEAN);
    }

    hbDNNInferCtrlParam infer_ctrl_param;
    HB_DNN_INITIALIZE_INFER_CTRL_PARAM(&infer_ctrl_param);
    HB_CHECK_SUCCESS(hbDNNInfer(&task_handle,
                                &output,
                                input_tensors.data(),
                                dnn_handle,
                                &infer_ctrl_param),
                     "hbDNNInfer failed");
    // wait task done
    HB_CHECK_SUCCESS(hbDNNWaitTaskDone(task_handle, 0),
                     "hbDNNWaitTaskDone failed");
  }

Next Step4 of the main function, we have now read the model, allocated memory space for the input and output tensors, and stored the image data to be inferred in the memory space of the input tensor, and we can immediately carry out forward inference. However, before invoking the hbDNNInfer interface to perform inference, we need to do some preparation. First, we need to create a new inference task using the hbDNNTaskHandle_t interface, initialize task_handle as a null pointer, and set the output pointer to the output tensor. Since the memory we allocate for the input tensor is cached, we need to first update the cached data to memory to ensure that the BPU can read the correct data. After initializing the control parameters of the inference, you can perform the forward inference using the hbDNNInfer interface. task_handle is the task handle, output is the output tensor, input_tensors is the input tensor, and dnn_handle is the model handle. infer_ctrl_param is the control parameter of inference task. You can set which BPU core the current task runs on and set the priority of the current task. You can use HB_DNN_INITIALIZE_INFER_CTRL_PARAM for default initialization Settings, as follows:

bpuCoreId = HB_BPU_CORE_ANY; dspCoreId = HB_DSP_CORE_ANY; priority = HB_DNN_PRIORITY_LOWEST; customId = 0; reserved1 = 0; reserved2 = 0;

Finally, the hbDNNWaitTaskDone interface is used to wait for the inference task to complete.

// Step5: do postprocess with output data
  std::vector<Classification> top_k_cls;
  {
    // make sure CPU read data from DDR before using output tensor data
    for (int i = 0; i < output_count; i++) {
      hbSysFlushMem(&output_tensors[i].sysMem[0], HB_SYS_MEM_CACHE_INVALIDATE);
    }

    get_topk_result(output, top_k_cls, FLAGS_top_k);
    for (int i = 0; i < FLAGS_top_k; i++) {
      VLOG(EXAMPLE_REPORT) << "TOP " << i << " result id: " << top_k_cls[i].id;
    }
  }

Step5 is the TopK calculation of post-processing. First, we define a vector array named top_k_cls. The attribute of this array is Classification, which is a structure defined before. It contains the class sequence number, classification score, and class name. Finally, the result is printed by VLOG. Before performing TopK calculations, the BPU output data needs to be synchronized from memory to cache so that the CPU can get the correct value.

void get_topk_result(hbDNNTensor *tensor,
                     std::vector<Classification> &top_k_cls,
                     int top_k) {
  hbSysFlushMem(&(tensor->sysMem[0]), HB_SYS_MEM_CACHE_INVALIDATE);
  std::priority_queue<Classification,
                      std::vector<Classification>,
                      std::greater<Classification>>
      queue;
  int *shape = tensor->properties.validShape.dimensionSize;
  // The type reinterpret_cast should be determined according to the output type
  // For example: HB_DNN_TENSOR_TYPE_F32 is float
  auto data = reinterpret_cast<float *>(tensor->sysMem[0].virAddr);
  auto shift = tensor->properties.shift.shiftData;
  auto scale = tensor->properties.scale.scaleData;
  int tensor_len = shape[0] * shape[1] * shape[2] * shape[3];
  for (auto i = 0; i < tensor_len; i++) {
    float score = 0.0;
    if (tensor->properties.quantiType == SHIFT) {
      score = data[i] / (1 << shift[i]);
    } else if (tensor->properties.quantiType == SCALE) {
      score = data[i] * scale[i];
    } else {
      score = data[i];
    }
    queue.push(Classification(i, score, ""));
    if (queue.size() > top_k) {
      queue.pop();
    }
  }
  while (!queue.empty()) {
    top_k_cls.emplace_back(queue.top());
    queue.pop();
  }
  std::reverse(top_k_cls.begin(), top_k_cls.end());
}

Let’s take a closer look at the get_topk_result function:

  1. The function has three inputs, tensor is the output tensor, top_k_cls is the vector array just defined, top_k refers to top_k’s highest score, in this example top_k is 5.

  2. In the specific implementation of this function, a priority queue is defined first. The element comparison mode of this priority queue is greater, meaning that the smaller the value, the higher the priority. Before using this queue, it will first determine whether the output data of the model needs to be inverse quantized, and if it needs to be inverse quantized, it will also determine whether the type is shift or scale.

  3. Floating-point score indicates the final score of a certain category. When a score is obtained, the category corresponding to the score is sent to the priority queue queue. If the number of elements in the queue exceeds top_k and reaches 6, the category with the highest priority is sent out of the queue. So the one who gets the lowest score out of these six categories must be the one who gets the lowest score. After a full round of this loop, the priority queue will only have the five highest scoring classes left.

  4. Finally, the data in queue is stored in the vector array top_k_cls for printing, and the whole TopK post-processing logic is completed.

    // Step6: release resources
    {
    // release task handle
    HB_CHECK_SUCCESS(hbDNNReleaseTask(task_handle), “hbDNNReleaseTask failed”);
    // free input mem
    for (int i = 0; i < input_count; i++) {
    HB_CHECK_SUCCESS(hbSysFreeMem(&(input_tensors[i].sysMem[0])),
    “hbSysFreeMem failed”);
    }
    // free output mem
    for (int i = 0; i < output_count; i++) {
    HB_CHECK_SUCCESS(hbSysFreeMem(&(output_tensors[i].sysMem[0])),
    “hbSysFreeMem failed”);
    }
    // release model
    HB_CHECK_SUCCESS(hbDNNRelease(packed_dnn_handle), “hbDNNRelease failed”);
    }

The last step in the main function, step 6, is just some finishing touches, releasing the task handle in turn, freeing the memory space applied for input and output, and finally releasing the model handle. Here, the whole quick start code interpretation is complete.

5 board end run

00_quick_start This example has two ways to run, the first is to use the emulator on the x86 side, the second is to actually run on the board side, and then take the j5 chip as an example to introduce, xj3 steps are similar.

  1. The method of simulating running on x86 is as follows: Run the build_x86.sh script in the code folder first, and after the execution of the script, the executable file and related dependencies will be generated in the j5 folder. Then we enter the j5/script_x86/00_quick_start directory and run the run_mobilenetV1.sh script. That is, you can run the example as an x86 emulator, perform forward inference of the model, and print the Top5 classification results.
  2. The specific method of board end operation is: Run the build_j5.sh script in the code folder first. After the execution of the script, files and related dependencies will be generated in the j5 folder. We copy the entire j5 folder to the board, enter the j5/script/00_quick_start directory, and run the run_mobilenetV1.sh script. You’re ready to run the quick Start sample on your development board.

This quick start example prints the following terminal running on the J5 development board:

I0000 00:00:00.000000 10765 vlog_is_on.cc:197] RAW: Set VLOG level for "*" to 3[BPU_PLAT]BPU Platform Version(1.3.3)!
[HBRT] set log level as 0. version = 3.15.18.0
[DNN] Runtime version = 1.17.2_(3.15.18 HBRT)[A][DNN][packed_model.cpp:225][Model](2023-04-11,17:51:17.206.804) [HorizonRT] The model builder version = 1.15.0
I0411 17:51:17.244180 10765 run_mobileNetV1_224x224.cc:135] DNN runtime version: 1.17.2_(3.15.18 HBRT)
I0411 17:51:17.244376 10765 run_mobileNetV1_224x224.cc:252] input[0] name is data
I0411 17:51:17.244508 10765 run_mobileNetV1_224x224.cc:268] output[0] name is prob
I0411 17:51:17.260176 10765 run_mobileNetV1_224x224.cc:159] read image to tensor as nv12 success
I0411 17:51:17.262075 10765 run_mobileNetV1_224x224.cc:194] TOP 0 result id: 340
I0411 17:51:17.262118 10765 run_mobileNetV1_224x224.cc:194] TOP 1 result id: 292
I0411 17:51:17.262148 10765 run_mobileNetV1_224x224.cc:194] TOP 2 result id: 282
I0411 17:51:17.262177 10765 run_mobileNetV1_224x224.cc:194] TOP 3 result id: 83
I0411 17:51:17.262205 10765 run_mobileNetV1_224x224.cc:194] TOP 4 result id: 290