Multi-model Priority Scheduling

D-Robotics · October 8, 2023, 9:48am

** Catalog ** **1 Foreword ** **2 Key Concepts ** 2.1 Model 2.2 task 2.3 function-call/FC 2.4 Maximum FC Continuous Execution Time 2.5 Priority **3 Scheduling Policy Overview ** **4 Scheduling strategy of board End prediction library ** 4.1 Basic Rules 4.2 Scheduling Multiple Processes 4.3 There are CPU operators at the beginning and end of the model 4.4 There are CPU operators in the middle of the model **5 System Software preemption Policy ** 5.1 high and normal Queues 5.2 Preemption Function Environment Variables **6 Configuration Method Summary ** 6.1 Model Compilation 6.2 Code Writing 6.3 On-Board Deployment

1 Preface

In the application development of board deployment, it is often encountered that multiple models run at the same time, and each model needs to use limited computing resources to complete inference, so there will be inevitable competition for computing resources. Horizon’s Board End Prediction Library (libDNN) has an efficient scheduling strategy, and in general, it can make reasonable scheduling of multi-model reasoning scenarios without manual intervention of developers to make full use of hardware resources. But sometimes, we want some models to have higher execution priorities for inference tasks, so to more accurately control the execution of multiple models, Horizon provides a priority scheduling policy for developers to use. It should be emphasized that the concept of priority scheduling is for the BPU resources of the Horizon computing platform, the CPU part of the model is handled by the Linux system itself, and the focus of this tutorial is on how to schedule multiple models at the same time and compute using the BPU resources. At the same time, ** priority scheduling is mainly for the scenario where multiple models compete for the same BPU resources **, if the two models monopolize different BPU cores, there is no need to consider priority scheduling. This tutorial will first sort out the key concepts involved in priority scheduling, then introduce the scheduling strategy at the level of the board predictive database, and the preemption function of the system software at the lower level, and finally summarize the specific configuration methods for implementing priority scheduling.

2 Key concepts

2.1 Model

In this tutorial, I refer specifically to the runtime model. runtime refers to embedded application development. The runtime model is a deep learning model for on-board deployment. The extension name is bin or hbm.

2.2 task

task refers to the inference task to which the model is bound before inference. The executive-end inference task must be submitted first, ** Each frame inference task corresponds to 1 task**. There are two interfaces that submit inference tasks, hbDNNInfer() and hbDNNRoiInfer(), which perform inference tasks based on input parameters, the latter of which is dedicated to performing ROI inference tasks. The hbDNNWaitTaskDone() interface is used to wait for a task to complete or time out. The hbDNNReleaseTask() interface is used to release tasks. The life cycle of a task starts from hbDNNInfer() or hbDNNRoiInfer() and ends from hbDNNReleaseTask(). A task can be bound to one or more models. Horizon’s XJ3 and J5 computing platforms both support multi-model batch inference scenarios. Invoking multiple small models to predict multiple data in a single inference task can effectively improve inference performance. Relevant article introduces visible community "[multi-model batch reasoning] (https://developer.horizon.cc/forumDetail/174216099150358530).

2.3 function-call/FC

function-call is the execution granularity of the BPU, abbreviated FC. The runtime model (bin/hbm) performs inference calculations on the BPU, which is represented by the invocation of 1 or more FCS. When all the FCS of a model are executed, the model is also concluded. It should be emphasized that **FC is a model-level property **, not a Task-level property.

2.4 Maximum FC Continuous execution time

This is a compile parameter. In PTQ, the parameter is max_time_per_fc, and in QAT, the parameter is max-time-per-fc. This parameter is used to specify the maximum continuous execution time per FC of the model, with values ranging from 0 to 1000-4294967295 (2^32-1) in microseconds (μs). The default value is 0, which indicates no limit. This parameter can be configured whether the runtime model is compiled using the PTQ or QAT process. For PTQ, max_time_per_fc is configured in the compiler_parameters group in the yaml file. For QAT, when using horizon_plugin_pytorch.quantization.com pile_model interface compiler model, can be configured in the additional parameters (extra_args) Max - time - per - fc, For example, extra_args=[‘–max-time-per-fc’,str(1000)]. If this parameter is not specified or is set to 0, the BPU subgraph of the model has only one FC segment. If max_time_per_fc=1000, the BPU subgraph is divided into multiple FC segments. The execution time of each FC segment is 1000 microseconds, except the last one, as shown in the following figure.

** The total execution time of split multi-segment FC is slightly longer than the execution time of 1 complete FC **. The more FC segments you split, the more additional time you add, but typically, the additional time required to run multiple FC segments is no more than 2% of the time required to run a single full FC segment. It is important to note that, like FC, the maximum execution time for FC is a model-level property.

2.5 Priority

Before submitting an inference task, configure the control parameters of the inference task. The related parameters are defined in the hbDNNInferCtrlParam structure.

typedef struct {
  int32_t bpuCoreId;
  int32_t dspCoreId;
  int32_t priority;  
  int32_t more;
  int64_t customId;
  int32_t reserved1;  
  int32_t reserved2;
} hbDNNInferCtrlParam;

priority and customId both indicate the priority of the task. Priority is higher than customId. The value of priority ranges from 0 to 255. The higher the value of priority, the higher the priority **. When priority=255, the task will have a special preemption priority, and the on-board predictive library and system software will provide more aggressive scheduling policies, which will be discussed in more detail later. The customId parameter comes into play when two tasks have the same priority. ** The smaller the customId value, the higher the priority **, usually assigned with a timestamp or frame id. In addition, bpuCoreId 0 represents any BPU core, 1 represents BPU core 0, and 2 represents BPU core 1. When considering the priority scheduling policy, the priority scheduling policy must be set to 1 or 2. If the priority scheduling policy is set to 0, you cannot control which BPU core the model runs on.

3 Scheduling policy Overview

Priority scheduling policies exist in two modules of the board prediction library and system software, as shown in the figure above. The board end prediction library is an inference library provided by Horizon for embedded application development to deploy the runtime model to the development board for running. The system software is closer to the bottom of the system than the on-board predictive library, provides the BPU driver, and parses the model instructions, and the BPU portion of the model is finally executed in the form of FC. In general, after the task is submitted, it will enter the queue of the on-board predictive database and sort it. The BPU part of the model will be sorted before being submitted to the FC queue of the system software. Finally, the task will be executed in the form of FC to complete the whole priority scheduling process.

4 Scheduling strategy of board end prediction database

4.1 Basic Rules

In actual deployment scenarios, we want the model to have extreme inference speed, so it is common to remove the first and last CPU operators from the model, and transfer the relevant calculations to the pre and post processing, while optimizing the model into a complete BPU subgraph, so this section will explain the situation.

Join the team

When the deployment code is executed to hbDNNInfer or hbDNNRoiInfer, the task is submitted to the queue in the on-board predictive database, and the queue orders the tasks. The tasks in the queue are unprocessed tasks waiting to be executed. Each time a new task enters the queue, it is sorted according to the priority of each task. Sorting is done according to priority and customId of the inference control parameter (hbDNNInferCtrlParam). First, compare priority. A larger value indicates a higher priority. When comparing customId with the same priority, a smaller value indicates a higher priority.

Out of line

Tasks with higher priorities are executed first. Suppose we have a higher priority taskA and a lower priority taskB, each bound to modelA and modelB. Originally, only taskB was waiting for processing in the queue, and taskA was submitted to the queue at this time. Since taskA’s priority is higher than taskB, taskA will be processed before taskB, and modelA will be processed before modelB. After taskA is out of the team, the BPU portion of modelA is converted to FC and sent to the FC queue of the system software. After taskB exits the queue, the BPU part of modelB is converted into FC and sent to the FC queue of the system software. At the level of system software, the two groups of FC will be further scheduled.

4.2 Multi-process scheduling

Section 4.1 describes only the scheduling mode of a single process. This section adds the scheduling mode of multiple processes. The J5 computing platform supports running programs in Service Mode and optimizes scheduling policies in multi-process scenarios. In general, when the Service Mode is disabled, the on-board predictive database can only schedule tasks within a single process, but cannot implement cross-process task scheduling. When a task is being executed in the current process, if a task with a higher priority occurs in another process, the task with a higher priority may not run in time because this information cannot be transmitted across processes. After the Service Mode is enabled, information about different processes is maintained in the shared memory. In this way, tasks of multiple processes can be scheduled in a unified manner and high priority tasks in multiple processes can be executed preferentially. Note: The XJ3 computing platform does not support Service Mode.

4.3 The model starts and ends with CPU operators

If the input side of the model has CPU operators, the linux system schedules and calculates the CPU operators when tasks are queued up. The CPU calculations are not delivered to the system software. After the execution of the CPU operators on the input side is complete, the BPU part is delivered to the system software in FC format. One situation to be aware of: If taskA comes out of the queue first and taskB comes out of the queue later, but taskA bound modelA has CPU operators at the input side and taskB bound modelB is a pure BPU model, then during the CPU operator calculation of modelA, modelB will first transform into FC and enter the system software queue first. For the CPU operator at the output end of the model, it will be executed only after all the FC of the BPU part of the model is executed in the system software, which is also scheduled and calculated by the linux system.

4.4 There are CPU operators in the middle of the model

Assuming that the modelA structure bound by taskA is BPU-CPU-BPU, and the modelB bound by taskB is a pure BPU model, taskA comes out of the queue first, taskB comes out of the queue later, modelA is executed first, and modelB waits in line. When executing CPU operators in the middle of modelA, the BPU resources become empty, and you can choose to let modelB plug in and run to make full use of the BPU resources. The function and ** three environment variables ** related, meet the conditions set by ** any ** environment variable can be inserted into the operation.

HB_DNN_BPU_SCHEDULE_THRESHOLD

The default value of this environment variable is 10, expressed in percentage (%), which indicates the degree of influence on the operation of the next BPU segment of modelA after modelB is inserted and run. If the degree of influence is less than the set value, the inserted operation is allowed. The value is calculated by division and the formula is as follows:

The portion of time that modelA second BPU is delayed due to queuejumping/Total modelA second BPU running time * 100%

This formula can also be rewritten as:

(BPU run time of modelB - CPU run time of modelA)/Total time of modelA second BPU run * 100%

For example, suppose that the second stage of modelA BPU runs for 100ms, and if modelB is inserted in the middle of the process, the second stage of modelA BPU starts running 5ms later than the original, then the affected degree is 5%. If the environment variable is set to 10, it means that the allowed influence degree is 10%. 5% is less than 10%, so modelB is allowed to run when modalA runs into the CPU segment.

HB_DNN_SCHEDULE_INSERT_FC_MAX_TIME

The default value of this environment variable is 1000, in milliseconds, and if the modelB run time is less than this value, the insert run is allowed.

HB_DNN_SCHEDULE_WAIT_DISPATCH_TIME

The default value of this environment variable is 20, in milliseconds, and modelB is allowed to be inserted if the modelA CPU segment runs longer than this value. Note: The running time of the BPU and CPU segments of the model will be recorded by the on-board prediction library, so as the number of inferences increases, the estimated time will be more accurate.

5 System software preemption policy

At the level of system software, there is no concept of model, the BPU part of the model has been transformed into FC, and the system software can only schedule FC. The FCS received by the system software are delivered by the on-board predictive database. 5.1 high and normal queues The system software has two FC queues, ** one is a high queue with preemptive priority and the other is a normal queue with non-preemptive priority **. Only the FC whose priority is 255 can enter the high queue, and the FC whose priority is between 0 and 254 can enter the normal queue. Model Multiple FC segments split from a BPU subgraph are sent to the high or normal queue at one time. If no FC whose priority is 255 is received, the high queue remains empty, and all FC whose priority is <255 enter the normal queue. In FIFO order, the FC that joins the queue first executes (regardless of priority). In this case, if the system software receives the FCS whose priority is 255, the FCS enter the high queue and preempt the execution at a specific time. After all the FCS in the high queue are executed, the FCS in the normal queue continue to execute. For FCS with preemption priority, the timing of preemption execution is related to the environment variable BPLAT_CORELIMIT of the system software.

5.2 Preempt functional environment variables

BPLAT_CORELIMIT is a software-level environment variable used to set FC preemption. The value ranges from 0 to positive integers. The default value of this parameter is 0, and FC preemption does not occur. If the value is set to n (n>0), the longest preemption wait time ** is the execution time of the first n FCS in the non-preemption priority queue.

In the above figure, the three blue non-preempt priority FCS in the normal queue are from the same BPU subgraph. The FC on the left of the normal queue is executed first. If BPLAT_CORELIMIT=1, the preempt priority FC will be executed immediately after the execution of the first non-preempt priority FC. If BPLAT_CORELIMIT=2, the execution will be preempted after the execution of the first or second non-preempted priority FC. In addition, if BPLAT_CORELIMIT=0, FC preemption does not occur, and the order of FC execution is first received by the system software. In general, to enable FC preemption at the system software level, set export BPLAT_CORELIMIT=1.

In addition, the following two points need to be noted:

For two tasks whose priority is 255, the FC of the model cannot preempt each other, even if the customids are different.
If the model splits out more than two FC segments, then the model can be preempted multiple times from the beginning to the end.

6 Configuration method summary

6.1 Model compilation

In the compilation phase, the low-optimal model needs to configure the maximum FC continuous execution time, so that the FC of the high-optimal model can be preempted in the system software queue. For the PTQ process, it can be configured in yaml as follows:

compiler_parameters:
    max_time_per_fc: 1000

For the QAT flow, you can configure it as follows when calling the compile_model interface:

compile_model(
    ...
    extra_args=['--max-time-per-fc',str(1000)]
)

6.2 Code writing

When you write the deployment code, you need to configure the bpuCoreId, priority, and customId (optional) of the model inference control parameters.

hbDNNInferCtrlParam infer_ctrl_param;
infer_ctrl_param.bpuCoreId = 1;   
infer_ctrl_param.priority = 255;  
infer_ctrl_param.customId = 0;

The bpuCoreId is 0 for any BPU core, 1 for BPU core 0, 2 for BPU core 1. Again, when considering the priority scheduling policy, the bpuCoreId needs to be configured as 1 or 2; if configured as 0, there is no control over which BPU core the model runs on. priority must be configured. customId is optional. customId takes effect only when priority is the same.

6.3 On-Board deployment

Before executing the inference program, run the following command in the current process to enable the environment variable of the system software preemption function:

export BPLAT_CORELIMIT=1
` ` `

> If you are confused about priority scheduling or want to enable Service Mode on the J5 computing platform, please contact Horizon Technical Support.