Based on ResNet34 voiceprint testing on X3 PI board

D-Robotics · October 31, 2022, 9:33am

With the development of artificial intelligence, the accuracy of voiceprint recognition has gradually improved, and the whole model is relatively simplified, which is very convenient to get started. In the context of the Internet of Things, voice print recognition, as the entrance to voice interaction, has a wide range of applications in security, smart home, smart cabin and other scenarios. In this article, we try to test a Resnet34-based voiceprint model using the Wespeaker toolkit (https://github.com/wenet-e2e/wespeaker and star).

Environment configuration

Install the model conversion tool
speaker_resnet34 https://github.com/wenet-e2e/wespeaker/tree/master/examples/cnceleb/v2
Export the ONNX model Exporting the ONNX model should be noted that opset needs to be specified as 11. Exporting the model requires specifying the names of the inputs and outputs. Since X3 PI only supports fixed-size input, all need to specify the shape of the input, the author specifies 2s voice input, for num_frams is 198. At the same time, it should be noted that the input of X3 PI is four-dimensional (NCHW), which needs to be checked when exporting ONNX.
Model check After exporting the ONNX model, you can use ‘hb_mapper check’ to verify that the exported model supports all operators. Run the command:

hb_mapper checker --model-type onnx
–march bernouli2
–model PATH_TO_ONNX
–input-shape feats 1x1x198x80
–output check.log
Model transformation The official documentation for model transformation is available https://developer.horizon.ai/api/v1/fileData/doc/ddk_doc/navigation/ai_toolchain/docs_cn/horizon_ai_toolchain_user_guide / model_convertion.html. During the model conversion process, calibration data is required. Specifically, it is necessary to prepare calibration data for each input of the model. For resnet34, the input is Fbank feature. Create feats folder to save calibration data, such as feats/0.bin,feats/1.bin, etc. Each bin file corresponds to FBANK feature of a voice. Calibration data is prepared as needed, the more prepared, the better the model calibration effect, but the corresponding calibration time appointment is more. The model transformation requires setting up the config file as follows:

model_parameters:
onnx_model: ‘path_of_model.onnx’

march: ‘bernoulli2’

output_model_file_prefix: ‘resnet34’

working_dir: ‘hb_model_output_dir/’

layer_out_dump: False

log_level: ‘debug’

input_parameters:

input_name: “feats”

input_type_train: ‘featuremap’

input_layout_train: ‘NCHW’

input_shape: ‘1x1x198x80’

input_batch: 1

norm_type: ‘no_preprocess’

mean_value: ‘103.94 116.78 123.68’

scale_value: ‘0.017’

input_type_rt: ‘featuremap’

input_space_and_range: ‘regular’

input_layout_rt: ‘NCHW’

calibration_parameters:
cal_data_dir: ‘cal_data_198/feats/’

#preprocess_on: False

calibration_type: ‘default’

max_percentile: 1.0

run_on_cpu: {OP_name}

run_on_cpu: “Pow_92;Sqrt_96”

run_on_bpu: {OP_name}

run_on_bpu: “ReduceMean_90;ReduceMean_93;Add_95”

compiler_parameters:
compile_mode: ‘latency’
debug: False
core_num: 1
optimize_level: ‘O3’

custom_op:

custom_op_method: register

op_register_files: sample_custom.py

custom_op_dir: ./custom_op

hb_mapper makertbin --config conf.yaml
–model-type onnx

================================================================================================================================
Node ON Subgraph Type Cosine Similarity Threshold

Squeeze_0 CPU – Reshape
ReduceMean_1 CPU – ReduceMean 1.000000 –
Sub_2 CPU – Sub 1.000000 –
Transpose_3 CPU – Transpose
Unsqueeze_4 CPU – Reshape
Conv_5 BPU id(0) HzSQuantizedConv 0.999848 9.839872
Conv_7 BPU id(0) HzSQuantizedConv 0.998837 2.374663
Conv_9 BPU id(0) HzSQuantizedConv 0.999373 2.144662
Conv_12 BPU id(0) HzSQuantizedConv 0.997498 2.672191
Conv_14 BPU id(0) HzSQuantizedConv 0.999045 1.846580
Conv_17 BPU id(0) HzSQuantizedConv 0.996736 2.964628
Conv_19 BPU id(0) HzSQuantizedConv 0.998685 1.834792
Conv_22 BPU id(0) HzSQuantizedConv 0.998433 3.429708
Conv_24 BPU id(0) HzSQuantizedConv 0.997778 2.074704
Conv_25 BPU id(0) HzSQuantizedConv 0.998896 3.429708
Conv_28 BPU id(0) HzSQuantizedConv 0.997199 2.137684
Conv_30 BPU id(0) HzSQuantizedConv 0.998810 1.845796
Conv_33 BPU id(0) HzSQuantizedConv 0.996739 2.429881
Conv_35 BPU id(0) HzSQuantizedConv 0.998526 1.726871
Conv_38 BPU id(0) HzSQuantizedConv 0.996185 3.030674
Conv_40 BPU id(0) HzSQuantizedConv 0.998205 1.853858
Conv_43 BPU id(0) HzSQuantizedConv 0.998171 3.847180
Conv_45 BPU id(0) HzSQuantizedConv 0.998297 1.774971
Conv_46 BPU id(0) HzSQuantizedConv 0.998878 3.847180
Conv_49 BPU id(0) HzSQuantizedConv 0.997234 2.220137
Conv_51 BPU id(0) HzSQuantizedConv 0.998745 1.546072
Conv_54 BPU id(0) HzSQuantizedConv 0.996856 1.983660
Conv_56 BPU id(0) HzSQuantizedConv 0.998637 1.389055
Conv_59 BPU id(0) HzSQuantizedConv 0.996533 2.238050
Conv_61 BPU id(0) HzSQuantizedConv 0.998524 1.433360
Conv_64 BPU id(0) HzSQuantizedConv 0.996596 2.703087
Conv_66 BPU id(0) HzSQuantizedConv 0.998297 1.456939
Conv_69 BPU id(0) HzSQuantizedConv 0.995955 3.479296
Conv_71 BPU id(0) HzSQuantizedConv 0.997830 1.801756
Conv_74 BPU id(0) HzSQuantizedConv 0.997351 3.761249
Conv_76 BPU id(0) HzSQuantizedConv 0.997645 1.968938
Conv_77 BPU id(0) HzSQuantizedConv 0.997343 3.761249
Conv_80 BPU id(0) HzSQuantizedConv 0.996631 1.828595
Conv_82 BPU id(0) HzSQuantizedConv 0.996716 1.301091
Conv_85 BPU id(0) HzSQuantizedConv 0.995967 1.832776
Conv_87 BPU id(0) HzSQuantizedConv 0.998240 1.269786
Relu_89 CPU – Relu 0.995270 –
ReduceMean_90 CPU – ReduceMean 0.998527 –
ReduceMean_90_reshape CPU – Reshape
ReduceMean_91 CPU – ReduceMean 0.998527 –
Sub_96 CPU – Sub 0.992839 –
Mul_97 CPU – Mul 0.992666 1.023778
ReduceMean_98 CPU – ReduceMean 0.997092 –
ReduceMean_98_reshape CPU – Reshape
Mul_100 CPU – Mul 0.997092 –
Div_103 CPU – Div 0.997092 –
Add_105 CPU – Add 0.997092 –
Sqrt_106 CPU – Sqrt 0.998121 –
Flatten_107 CPU – Flatten 0.998527 –
Flatten_108 CPU – Flatten 0.998121 –
Concat_109 CPU – Concat 0.998287 –
Gemm_110_pre_reshape CPU – Reshape
Gemm_110 BPU id(1) HzSQuantizedConv 0.997718 0.954655
Gemm_110_NHWC2NCHW_LayoutConvert_Output0_reshape CPU – Reshape
Sub_111 CPU – Sub 0.997718 –
2022-10-18 15:05:03,882 e[92mINFOe[0m The quantify model output:

Node Cosine Similarity L1 Distance L2 Distance Chebyshev Distance

Sub_111 0.997718 0.007392 0.000591 0.030291
Board test After the conversion, we can get the corresponding bin file. The author uses C++ to write the inference code on the board, after writing on the linux machine, after cmake is compiled, the relevant static dependencies are copied to the board to run. The associated runtime C++ code will later be open sourced to Wespeaker.

Final effect

Enter two audio and determine whether they are the same speaker.

I20221031 17:23:30.933823  8448 bpu_speaker_model.cc:86] Model_path:speaker_resnet34_xj3.bin
I20221031 17:23:30.933848  8448 asv_api_main.cc:29] Init model ...
I20221031 17:23:30.934250  8448 asv_api_main.cc:35] enroll embedding ...
I20221031 17:23:30.994343  8448 bpu_speaker_model.cc:115] feat_size: 198
I20221031 17:23:31.025279  8448 bpu_speaker_model.cc:134] 256
I20221031 17:23:31.025346  8448 bpu_speaker_model.cc:115] feat_size: 198
I20221031 17:23:31.054265  8448 bpu_speaker_model.cc:134] 256
I20221031 17:23:31.054364  8448 bpu_speaker_model.cc:115] feat_size: 198
I20221031 17:23:31.083324  8448 bpu_speaker_model.cc:134] 256
I20221031 17:23:31.094725  8448 speaker_api.cc:49] over enroll ...
I20221031 17:23:31.095248  8448 asv_api_main.cc:45] compute score ...
I20221031 17:23:31.098816  8448 speaker_api.cc:87] Read enroll embedding ...
I20221031 17:23:31.098871  8448 speaker_api.cc:92] Extracting test embedding ...
I20221031 17:23:31.158486  8448 bpu_speaker_model.cc:115] feat_size: 198
I20221031 17:23:31.187544  8448 bpu_speaker_model.cc:134] 256
I20221031 17:23:31.187639  8448 bpu_speaker_model.cc:115] feat_size: 198
I20221031 17:23:31.216475  8448 bpu_speaker_model.cc:134] 256
I20221031 17:23:31.216565  8448 bpu_speaker_model.cc:115] feat_size: 198
I20221031 17:23:31.245415  8448 bpu_speaker_model.cc:134] 256
I20221031 17:23:31.245725  8448 speaker_api.cc:94] 256
I20221031 17:23:31.245760  8448 asv_api_main.cc:49] Cosine socre: 0.8165
I20221031 17:23:31.245805  8448 asv_api_main.cc:51] It's the same speaker!

Based on ResNet34 voiceprint testing on X3 PI board

Environment configuration

input_name: “feats”

mean_value: ‘103.94 116.78 123.68’

scale_value: ‘0.017’

input_space_and_range: ‘regular’

run_on_cpu: {OP_name}

run_on_cpu: “Pow_92;Sqrt_96”

run_on_bpu: {OP_name}

run_on_bpu: “ReduceMean_90;ReduceMean_93;Add_95”

custom_op:

custom_op_method: register

op_register_files: sample_custom.py

custom_op_dir: ./custom_op

================================================================================================================================ Node ON Subgraph Type Cosine Similarity Threshold

Node Cosine Similarity L1 Distance L2 Distance Chebyshev Distance

Final effect

================================================================================================================================
Node ON Subgraph Type Cosine Similarity Threshold