1: Process overview
As the beginning of this article, the author would like to lead you to ** * knock the command line by line through the installation from environment to * model conversion **, Then the whole process of deploying on the board. On the one hand, the process overview makes it easy for the reader to get through the whole article as quickly as possible (the technical details of each step are fully covered in the following sections); On the other hand, although this is a rush, but the author has been convinced for many years that “hands-on practice is far better than intensive reading of the document”, the best learning path should be to run through the whole process and then go back to the document content one by one (personal opinion, do not like to spray).
Step-1: Environment preparation for model transformation
# This article is limited to WSL2 Ubuntu20.04 or any native Ubuntu system
# 0. Install WeNet environment (10min)
# 0.1 For the convenience of domestic users, WeNet's image repository is synchronized on gitee to facilitate git clone
# 0.2 Model training is not involved here, so install the cpu version of torch directly in the pip form instead of the WeNet official recommended conda install form
conda create -n horizonbpu python=3.8
conda activate horizonbpu
git clone https://gitee.com/xcsong-thu/wenet.git
cd wenet/runtime/horizonbpu
pip install -r .. /.. /requirements.txt -i https://mirrors.aliyun.com/pypi/simple
pip install torch==1.13.0 torchaudio==0.13.0 torchvision==0.14.0 onnx onnxruntime -i https://mirrors.aliyun.com/pypi/simple
# 1. Install the Horizon Model Conversion Kit and its dependencies (1min)
wget https://gitee.com/xcsong-thu/toolchain_pkg/releases/download/resource/wheels.tar.gz
tar -xzf wheels.tar.gz
pip install wheels/* -i https://mirrors.aliyun.com/pypi/simple
# 3. Install the cross-compilation tool (1min) (wsl2 is recommended, with sudo permission)
sudo apt-get install gcc-aarch64-linux-gnu g++-aarch64-linux-gnu
Step-2: C++ Demo compilation
# NOTE: Assume that in the wenet/runtime/horizonbpu path (make sure that the following build folder path is correct)
Compile the main program decoder_main (20min, involving downloading gflag glog from github, may need to climb over the wall)
# Compile using Aarch64-linux-gU-gcc and aarch64-Linux-GU-G ++ in the system path installed in Step-1, where the installation location can be seen through whereis Aarch64-linux-GU-G ++
cmake -B build -DBPU=ON -DONNX=OFF -DTORCH=OFF -DWEBSOCKET=OFF -DGRPC=OFF -DCMAKE_TOOLCHAIN_FILE=toolchains/aarch64-linux-gnu.toolchain.cmake
cmake --build build
# 1. There is no need to wait for the cross-compilation to complete the Step-3. The compilation and model conversion of C++ Demo do not interfere with each other, and can be parallel, please directly move to Step-3
Step-3: Officially start model conversion
# NOTE: Assume that in the wenet/runtime/horizonbpu path (make sure the relative path of export is correct)
# 0. Configure the path (1min)
conda activate horizonbpu
export WENET_DIR=$PWD/.. /.. /
export PYTHONIOENCODING=UTF-8
export PYTHONPATH=$WENET_DIR:$PYTHONPATH
export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION='python'
# 1. Download torch's floating point model (3min)
wget https://ghproxy.com/https://github.com/xingchensong/toolchain_pkg/releases/download/conformer_subsample8_110M/model_subs ample8_parameter110M.tar.gz
tar -xzf model_subsample8_parameter110M.tar.gz
# 2. Execute the transformation, pytorch model -> onnx model -> bin model (~40min)
# Because the model is too large (100 million parameters), the conversion process requires high memory (at least 16G).
# If the memory is not enough or wait 40 minutes, you can download the compiled encoder.bin/ctc.bin from the following link (chunksize=8, leftchunk=16 as an example).
# https://github.com/xingchensong/toolchain_pkg/releases/tag/model_converted_chunksize8_leftchunk16
python3 $WENET_DIR/tools/onnx2horizonbin.py \
--config ./model_subsample8_parameter110M/train.yaml \
--checkpoint ./model_subsample8_parameter110M/final.pt \
--output_dir ./model_subsample8_parameter110M/sample50_chunk8_leftchunk16 \
--chunk_size 8 \
--num_decoding_left_chunks 16 \
--max_samples 50 \
--dict ./model_subsample8_parameter110M/units.txt \
--cali_datalist ./model_subsample8_parameter110M/calibration_data/data.list
Step-4: On-board deployment
# NOTE: assume that in the wenet/runtime/horizonbpu path (to make sure that the fc_base path is correct below)
# 0. Upload the cross-compiled product [main program decoder_main] and [required dynamic library] to the board (1min)
export BPUIP=xxx.xxx.xxx
export DEMO_PATH_ON_BOARD=/path/to/demo
scp build/bin/decoder_main sunrise@$BPUIP:$DEMO_PATH_ON_BOARD
scp fc_base/easy_dnn-src/dnn/*j3*/*/*/lib/libdnn.so sunrise@$BPUIP:$DEMO_PATH_ON_BOARD
scp fc_base/easy_dnn-src/easy_dnn/*j3*/*/*/lib/libeasy_dnn.so sunrise@$BPUIP:$DEMO_PATH_ON_BOARD
scp fc_base/easy_dnn-src/hlog/*j3*/*/*/lib/libhlog.so sunrise@$BPUIP:$DEMO_PATH_ON_BOARD
# 1. Upload the model conversion product [encoder.bin] and [ctc.bin] to the board, along with the sample audio and model dictionary (2min)
scp ./model_subsample8_parameter110M/sample50_chunk8_leftchunk16/hb_makertbin_output_encoder/encoder.bin sunrise@$BPUIP:$DEMO_PATH_ON_BOARD
scp ./model_subsample8_parameter110M/sample50_chunk8_leftchunk16/hb_makertbin_output_ctc/ctc.bin sunrise@$BPUIP:$DEMO_PATH_ON_BOARD
scp ./model_subsample8_parameter110M/test_wav.wav sunrise@$BPUIP:$DEMO_PATH_ON_BOARD
scp ./model_subsample8_parameter110M/units.txt sunrise@$BPUIP:$DEMO_PATH_ON_BOARD
# 2. Log in to x3pi and test
ssh sunrise@$BPUIP
# # # # # # # # # # # # # # # # # # # # # # # # the command execution in X3PI # # # # # # # # # # # # # # # # # # # # # # # # # # #
cd /path/to/demo
sudo LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH \
GLOG_logtostderr=1 GLOG_v=2 \
./decoder_main \
--chunk_size 8 \
--num_left_chunks 16 \
--rescoring_weight 0.0 \
--wav_path ./test_wav.wav \
--bpu_model_dir ./ \
--unit_path ./units.txt 2>&1 | tee log.txt
- Example log output of model conversion process: see attachment hb_mapper_makertbin(1).log
Decoder_main 示例
2: Technical details
Step-1: Environment preparation for model transformation
There is nothing strange about the environment preparation itself, and what I want to focus on here is: the pytorch version upgrade brings the best leapfrog experience improvement ** brought by the accuracy bottleneck ** and speed bottleneck ** analysis.
In the official installation package provided by the Horizon developer community, in order to be compatible with the Training Algorithm package Chart (HAT), pytorch version is installed as 1.10.0. The pytorch version itself has no effect on the accuracy of the model transformation, but the onnx exported by different versions of pytorch has a significant difference in node (or op) naming. As an example, run the following code through python with one click (assuming you have successfully installed the conda environment in Chapter 1, you need conda activate horizonbpu) :
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# Copyright [2022-10-28] <sxc19@mails.tsinghua.edu.cn, Xingchen Song>
import os
import torch
config = """
model_parameters:
onnx_model: './demo.onnx'
march: 'bernoulli2'
output_model_file_prefix: 'demo'
working_dir: './hb_makertbin_output'
layer_out_dump: False
log_level: 'debug'
input_parameters:
input_name: 'in'
input_type_train: 'featuremap;'
input_layout_train: 'NCHW;'
input_shape: '1x30x30x10;'
# input_batch: 1
norm_type: 'no_preprocess;'
# mean_value: ''
# scale_value: ''
input_type_rt: 'featuremap;'
input_space_and_range: ''
input_layout_rt: 'NCHW;'
calibration_parameters:
cal_data_dir: './calibration_data/'
preprocess_on: False
calibration_type: 'default'
max_percentile: 1.0
# run_on_cpu: ''
# run_on_bpu: ''
compiler_parameters:
compile_mode: 'latency'
debug: False
core_num: 1
optimize_level: 'O3'
"""
with open("./config_onnx2bin.yaml", "w") as f:
f.write(config)
def to_numpy(tensor):
if tensor.requires_grad:
return tensor.detach().cpu().numpy()
else:
return tensor.cpu().numpy()
class SubLayer(torch.nn.Module):
def __init__(self):
super().__init__()
self.nn = torch.nn.Conv2d(30, 30, 1, 1)
self.act = torch.nn.SiLU()
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.act(self.nn(x))
class Layer(torch.nn.Module):
def __init__(self):
super().__init__()
self.sublayer1 = SubLayer()
self.sublayer2 = SubLayer()
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.sublayer2(self.sublayer1(x))
# NOTE(xcosng): Model -> Layer -> SubLayer -> OP 四层级结构
class Model(torch.nn.Module):
def __init__(self):
super().__init__()
self.layer1 = Layer()
self.layer2 = Layer()
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.layer2(self.layer1(x))
input_data = torch.randn(1, 30, 30, 10)
demo = Model()
torch.onnx.export(
demo, input_data, "./demo.onnx", opset_version=11,
input_names=['in'], output_names=['out']
)
os.makedirs("./calibration_data", exist_ok=True)
to_numpy(input_data).tofile("./calibration_data/0.bin")
os.system(
"hb_mapper makertbin \
--model-type \"onnx\" \
--config \"./config_onnx2bin.yaml\""
)
using hobot::easy_dnn::Model;
using hobot::easy_dnn::DNNTensor;
using hobot::easy_dnn::TaskManager;
using hobot::easy_dnn::ModelManager;
class BPUAsrModel : public AsrModel {
public:
BPUAsrModel() = default;
~BPUAsrModel();
BPUAsrModel(const BPUAsrModel& other);
void Read(const std::string& model_dir);
void PrepareEncoderInput(const std::vector<std::vector<float>>& chunk_feats);
// 其他成员函数...
protected:
void ForwardEncoderFunc(const std::vector<std::vector<float>>& chunk_feats,
std::vector<std::vector<float>>* ctc_prob) override;
private:
// models
std::shared_ptr<Model> encoder_model_ = nullptr;
std::shared_ptr<Model> ctc_model_ = nullptr;
// input/output tensors, 使用vector方便应对单模型多输入的情况
std::vector<std::shared_ptr<DNNTensor>> encoder_input_, encoder_output_;
std::vector<std::shared_ptr<DNNTensor>> ctc_input_, ctc_output_;
// 其他成员变量...
};
void BPUAsrModel::Read(const std::string& model_dir) {
std::string encoder_model_path = model_dir + "/encoder.bin";
std::string ctc_model_path = model_dir + "/ctc.bin";
ModelManager* model_manager = ModelManager::GetInstance();
std::vector<Model*> models;
model_manager->Load(models, encoder_model_path);
encoder_model_.reset(model_manager->GetModel([](Model* model) {
return model->GetName().find("encoder") != std::string::npos;
}));
model_manager->Load(models, ctc_model_path);
ctc_model_.reset(model_manager->GetModel([](Model* model) {
return model->GetName().find("ctc") != std::string::npos;
}));
}
void BPUAsrModel::PrepareEncoderInput(...) {
for (auto& single_input : encoder_input_) {
auto feat_ptr = reinterpret_cast<float*>(single_input->sysMem[0].virAddr);
memset(single_input->sysMem[0].virAddr, 0, single_input->properties.alignedByteSize);
}
}
void BPUAsrModel::ForwardEncoderFunc(...) {
TaskManager* task_manager = TaskManager::GetInstance();
PrepareEncoderInput(chunk_feats);
for (auto& tensor : encoder_input_) {
hbSysFlushMem(&(tensor->sysMem[0]), HB_SYS_MEM_CACHE_CLEAN);
}
auto infer_task = task_manager->GetModelInferTask(1000);
infer_task->SetModel(encoder_model_.get());
infer_task->SetInputTensors(encoder_input_);
infer_task->SetOutputTensors(encoder_output_);
infer_task->RunInfer();
infer_task->WaitInferDone(1000);
infer_task.reset();
for (auto& tensor : encoder_output_) {
hbSysFlushMem(&(tensor->sysMem[0]), HB_SYS_MEM_CACHE_INVALIDATE);
}
// 2. Forward CTC.bin
PrepareCtcInput();
for (auto& tensor : ctc_input_) {
hbSysFlushMem(&(tensor->sysMem[0]), HB_SYS_MEM_CACHE_CLEAN);
}
infer_task = task_manager->GetModelInferTask(1000);
infer_task->SetModel(ctc_model_.get());
infer_task->SetInputTensors(ctc_input_);
infer_task->SetOutputTensors(ctc_output_);
infer_task->RunInfer();
infer_task->WaitInferDone(1000);
infer_task.reset();
for (auto& tensor : ctc_output_) {
hbSysFlushMem(&(tensor->sysMem[0]), HB_SYS_MEM_CACHE_INVALIDATE);
}
}
Step-3: Officially start model conversion
1. Rewrite the Transformer model with one line of code
Using a toolchain to convert a native Transformer model in the NLP space can be a very bad experience (and even a direct error in the conversion process). This is because in Transformer in NLP, the dimensions of the input tensor are usually 2D or 3D, and the type contains both float and long. The XJ3 chip only focuses on visual tasks when it is designed, usually floating point four-dimensional image input, and the toolchain only optimises the experience of this kind of visual model.
So, in order to convert the NLP class Transformer, do we need to train a four-dimensional data flow model from scratch? The answer is obviously no. In this paper, by using ** equivalent replacement ** and ** abstract encapsulation **, a line of code is implemented to convert native Transformer to BPU-friendly Transformer :
Encoder4D = wenet.bin.export_onnx_bpu.BPUTransformerEncoder(Encoder3D)
import torch
class BPULinear(torch.nn.Module): [817/2069] """Refactor torch.nn.Linear or pointwise_conv"""
def __init__(self, module):
super().__init__()
# Unchanged submodules and attributes
original = copy.deepcopy(module)
self.idim = module.weight.size(1)
self.odim = module.weight.size(0)
# Modify weight & bias
self.linear = torch.nn.Conv2d(self.idim, self.odim, 1, 1)
# (odim, idim) -> (odim, idim, 1, 1)
self.linear.weight = torch.nn.Parameter( module.weight.unsqueeze(2).unsqueeze(3))
self.linear.bias = module.bias
self.check_equal(original)
def check_equal(self, module):
random_data = torch.randn(1, 8, self.idim)
original_result = module(random_data)
random_data = random_data.transpose(1, 2).unsqueeze(2)
new_result = self.forward(random_data)
np.testing.assert_allclose(
to_numpy(original_result),
to_numpy(new_result.squeeze(2).transpose(1, 2)),
rtol=1e-02, atol=1e-03)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""Linear with 4-D dataflow.
Args:
x (torch.Tensor): (batch, in_channel, 1, time)
Returns:
(torch.Tensor): (batch, out_channel, 1, time).
"""
return self.linear(x)
OrigLinear = torch.nn.Linear(10, 10)
NewLinear = BPULinear(OrigLinew)
在成功改写 Linear 的基础上,进一步地,我们可以完成 Transformer 中最常见的 FeedForward 层(FFN)改写:
class BPUFFN(torch.nn.Module):
"""Refactor wenet/transformer/positionwise_feed_forward.py::PositionwiseFeedForward
"""
def __init__(self, module):
super().__init__() # Unchanged submodules and attributes
original = copy.deepcopy(module)
self.activation = module.activation
# 1. Modify self.w_x
self.w_1 = BPULinear(module.w_1)
self.w_2 = BPULinear(module.w_2)
self.check_equal(original)
def check_equal(self, module):
random_data = torch.randn(1, 8, self.w_1.idim)
original_out = module(random_data)
random_data = random_data.transpose(1, 2).unsqueeze(2)
new_out = self.forward(random_data)
np.testing.assert_allclose(
to_numpy(original_out),
to_numpy(new_out.squeeze(2).transpose(1, 2)),
rtol=1e-02, atol=1e-03)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""Forward function.
Args:
xs: input tensor (B, D, 1, L)
Returns:
output tensor, (B, D, 1, L)
"""
return self.w_2(self.activation(self.w_1(x)))
除此之外,对于其他层的改写(如 MultiheadAttention)都是类似的,这里不再赘述,感兴趣可以仔细研读开源代码。
2. 一句命令 走完转换全流程
一个完整pytorch模型到bpu模型的转换流程,一般要经过如下四步:
- pytorch 模型 转 onnx 模型
- 构造 Calibration 数据
- 构造 config.yaml
- 调用 hb_mapper 执行 onnx 转 bpu bin
在 WeNet 开源的代码中,我们用人民群众喜闻乐见的 python 把这四个步骤 “粘” 到了一起,使用如下命令,就可走完全流程。
python3 $WENET_DIR/tools/onnx2horizonbin.py \
--config ./model_subsample8_parameter110M/train.yaml \
--checkpoint ./model_subsample8_parameter110M/final.pt \
--output_dir ./model_subsample8_parameter110M/sample50_chunk8_leftchunk16 \
--chunk_size 8 \
--num_decoding_left_chunks 16 \
--max_samples 50 \
--dict ./model_subsample8_parameter110M/units.txt \
--cali_datalist ./model_subsample8_parameter110M/calibration_data/data.list
其中
- config (描述了模型配置,几层layer等)
- checkpoint (pytorch 浮点模型)
- output_dir (.bin 文件输出目录)
- chunk_size (跟识别有关的解码参数)
- num_decoding_left_chunks (跟识别有关的解码参数)
- max_samples (使用多少句数据制作calibration data)
- dict (字典)
- cali_datalist (描述了标定数据的位置)
更具体地,本着 pythonic 的原则,在 onnx2horizonbin.py 中,我们将 config.yaml 的构造模板化并在python代码中以填充字段的形式对模板实例化:
def generate_config(enc_session, ctc_session, args):
# 如下的template字符串是config.yaml的模板化字符串
# 通过.format填充相关字段,使得DIY变得更方便
template = """
# 模型参数组
model_parameters: # 原始Onnx浮点模型文件
onnx_model: '{}'
# 转换的目标AI芯片架构
march: 'bernoulli2'
# 模型转换输出的用于上板执行的模型文件的名称前缀
output_model_file_prefix: '{}'
# 模型转换输出的结果的存放目录
working_dir: '{}'
# 指定转换后混合异构模型是否保留输出各层的中间结果的能力
layer_out_dump: False
# 转换过程中日志生成级别
log_level: 'debug'
# 输入信息参数组
input_parameters:
# 原始浮点模型的输入节点名称
input_name: '{}'
# 原始浮点模型的输入数据格式(数量/顺序与input_name一致)
input_type_train: '{}'
# 原始浮点模型的输入数据排布(数量/顺序与input_name一致)
input_layout_train: '{}'
# 原始浮点模型的输入数据尺寸
input_shape: '{}'
# 网络实际执行时,输入给网络的batch_size 默认值为1
# input_batch: 1
# 在模型中添加的输入数据预处理方法
norm_type: '{}'
# 预处理方法的图像减去的均值; 如果是通道均值,value之间必须用空格分隔
# mean_value: ''
# 预处理方法的图像缩放比例,如果是通道缩放比例,value之间必须用空格分隔
# scale_value: ''
# 转换后混合异构模型需要适配的输入数据格式(数量/顺序与input_name一致)
input_type_rt: '{}'
# 输入数据格式的特殊制式
input_space_and_range: ''
# 转换后混合异构模型需要适配的输入数据排布(数量/顺序与input_name一致)
input_layout_rt: '{}'
# 校准参数组
calibration_parameters:
# 模型校准使用的标定样本的存放目录
cal_data_dir: '{}'
# 开启图片校准样本自动处理(skimage read resize到输入节点尺寸)
preprocess_on: False
# 校准使用的算法类型 [145/2559] calibration_type: '{}'
# max 校准方式的参数
max_percentile: 1.0
# 强制指定OP在CPU上运行
run_on_cpu: '{}'
# 强制指定OP在BPU上运行
run_on_bpu: '{}'
# 编译参数组
compiler_parameters:
# 编译策略选择
compile_mode: 'latency'
# 是否打开编译的debug信息
debug: False
# 模型运行核心数
core_num: 1
# 模型编译的优化等级选择
optimize_level: 'O3'
"""
output_dir = os.path.realpath(args.output_dir)
cal_data_dir = os.path.join(output_dir, 'cal_data_dir')
os.makedirs(cal_data_dir, exist_ok=True)
# 在导出 onnx 的过程中将各种属性(shape / layout 等)写入 onnx 模型的 custom_metadata_map
# 使得shape等信息的推断变成了全自动进行,而不需要人工在config.yaml中指定。
# 对于多输入的模型以及需要经常变化shape的模型,这种模板化的方式相比所有字段写死的config.yaml
# 其优势在于:不需要任何人工修改就可以适配
enc_dic = enc_session.get_modelmeta().custom_metadata_map
enc_onnx_path = os.path.join(output_dir, 'encoder.onnx')
enc_log_path = os.path.join(output_dir, 'hb_makertbin_output_encoder')
enc_cal_data = ";".join(
[cal_data_dir + "/" + x for x in enc_dic['input_name'].split(';')])
ctc_dic = ctc_session.get_modelmeta().custom_metadata_map
ctc_onnx_path = os.path.join(output_dir, 'ctc.onnx')
ctc_log_path = os.path.join(output_dir, 'hb_makertbin_output_ctc')
ctc_cal_data = ";".join(
[cal_data_dir + "/" + x for x in ctc_dic['input_name'].split(';')])
# config.yaml的模板字符串,通过.format填充相关字段,使得DIY变得更方便
enc_config = template.format(
enc_onnx_path, "encoder", enc_log_path,
enc_dic['input_name'], enc_dic['input_type'],
enc_dic['input_layout_train'], enc_dic['input_shape'],
enc_dic['norm_type'], enc_dic['input_type'], enc_dic['input_layout_rt'],
enc_cal_data, args.calibration_type, args.extra_ops_run_on_cpu, "")
ctc_config = template.format(
ctc_onnx_path, "ctc", ctc_log_path,
ctc_dic['input_name'], ctc_dic['input_type'],
ctc_dic['input_layout_train'], ctc_dic['input_shape'],
ctc_dic['norm_type'], ctc_dic['input_type'], ctc_dic['input_layout_rt'],
ctc_cal_data, "default", "", "")
# 模板实例化后导出为真实的 config.yaml
with open(output_dir + "/config_encoder.yaml", "w") as enc_yaml:
enc_yaml.write(enc_config)
with open(output_dir + "/config_ctc.yaml", "w") as ctc_yaml:
ctc_yaml.write(ctc_config)
(/api/v1/static/imgData/1670159241799.png)
video (chunk8 leftchunk 16)
Server - Client 示例
Decoder_main 示例