Based on ResNet34 voiceprint testing on X3 PI board

With the development of artificial intelligence, the accuracy of voiceprint recognition has gradually improved, and the whole model is relatively simplified, which is very convenient to get started. In the context of the Internet of Things, voice print recognition, as the entrance to voice interaction, has a wide range of applications in security, smart home, smart cabin and other scenarios. In this article, we try to test a Resnet34-based voiceprint model using the Wespeaker toolkit (https://github.com/wenet-e2e/wespeaker and star).

Environment configuration

  1. Install the model conversion tool

  2. speaker_resnet34 https://github.com/wenet-e2e/wespeaker/tree/master/examples/cnceleb/v2

  3. Export the ONNX model Exporting the ONNX model should be noted that opset needs to be specified as 11. Exporting the model requires specifying the names of the inputs and outputs. Since X3 PI only supports fixed-size input, all need to specify the shape of the input, the author specifies 2s voice input, for num_frams is 198. At the same time, it should be noted that the input of X3 PI is four-dimensional (NCHW), which needs to be checked when exporting ONNX.

  4. Model check After exporting the ONNX model, you can use ‘hb_mapper check’ to verify that the exported model supports all operators. Run the command:

    hb_mapper checker --model-type onnx
    –march bernouli2
    –model PATH_TO_ONNX
    –input-shape feats 1x1x198x80
    –output check.log

  5. Model transformation The official documentation for model transformation is available https://developer.horizon.ai/api/v1/fileData/doc/ddk_doc/navigation/ai_toolchain/docs_cn/horizon_ai_toolchain_user_guide / model_convertion.html. During the model conversion process, calibration data is required. Specifically, it is necessary to prepare calibration data for each input of the model. For resnet34, the input is Fbank feature. Create feats folder to save calibration data, such as feats/0.bin,feats/1.bin, etc. Each bin file corresponds to FBANK feature of a voice. Calibration data is prepared as needed, the more prepared, the better the model calibration effect, but the corresponding calibration time appointment is more. The model transformation requires setting up the config file as follows:

    model_parameters:
    onnx_model: ‘path_of_model.onnx’

    march: ‘bernoulli2’

    output_model_file_prefix: ‘resnet34’

    working_dir: ‘hb_model_output_dir/’

    layer_out_dump: False

    log_level: ‘debug’

    input_parameters:

    input_name: “feats”

    input_type_train: ‘featuremap’

    input_layout_train: ‘NCHW’

    input_shape: ‘1x1x198x80’

    input_batch: 1

    norm_type: ‘no_preprocess’

    mean_value: ‘103.94 116.78 123.68’

    scale_value: ‘0.017’

    input_type_rt: ‘featuremap’

    input_space_and_range: ‘regular’

    input_layout_rt: ‘NCHW’

    calibration_parameters:
    cal_data_dir: ‘cal_data_198/feats/’

    #preprocess_on: False

    calibration_type: ‘default’

    max_percentile: 1.0

    run_on_cpu: {OP_name}

    run_on_cpu: “Pow_92;Sqrt_96”

    run_on_bpu: {OP_name}

    run_on_bpu: “ReduceMean_90;ReduceMean_93;Add_95”

    compiler_parameters:
    compile_mode: ‘latency’
    debug: False
    core_num: 1
    optimize_level: ‘O3’

    custom_op:

    custom_op_method: register

    op_register_files: sample_custom.py

    custom_op_dir: ./custom_op

    hb_mapper makertbin --config conf.yaml
    –model-type onnx

    ================================================================================================================================
    Node ON Subgraph Type Cosine Similarity Threshold

    Squeeze_0 CPU – Reshape
    ReduceMean_1 CPU – ReduceMean 1.000000 –
    Sub_2 CPU – Sub 1.000000 –
    Transpose_3 CPU – Transpose
    Unsqueeze_4 CPU – Reshape
    Conv_5 BPU id(0) HzSQuantizedConv 0.999848 9.839872
    Conv_7 BPU id(0) HzSQuantizedConv 0.998837 2.374663
    Conv_9 BPU id(0) HzSQuantizedConv 0.999373 2.144662
    Conv_12 BPU id(0) HzSQuantizedConv 0.997498 2.672191
    Conv_14 BPU id(0) HzSQuantizedConv 0.999045 1.846580
    Conv_17 BPU id(0) HzSQuantizedConv 0.996736 2.964628
    Conv_19 BPU id(0) HzSQuantizedConv 0.998685 1.834792
    Conv_22 BPU id(0) HzSQuantizedConv 0.998433 3.429708
    Conv_24 BPU id(0) HzSQuantizedConv 0.997778 2.074704
    Conv_25 BPU id(0) HzSQuantizedConv 0.998896 3.429708
    Conv_28 BPU id(0) HzSQuantizedConv 0.997199 2.137684
    Conv_30 BPU id(0) HzSQuantizedConv 0.998810 1.845796
    Conv_33 BPU id(0) HzSQuantizedConv 0.996739 2.429881
    Conv_35 BPU id(0) HzSQuantizedConv 0.998526 1.726871
    Conv_38 BPU id(0) HzSQuantizedConv 0.996185 3.030674
    Conv_40 BPU id(0) HzSQuantizedConv 0.998205 1.853858
    Conv_43 BPU id(0) HzSQuantizedConv 0.998171 3.847180
    Conv_45 BPU id(0) HzSQuantizedConv 0.998297 1.774971
    Conv_46 BPU id(0) HzSQuantizedConv 0.998878 3.847180
    Conv_49 BPU id(0) HzSQuantizedConv 0.997234 2.220137
    Conv_51 BPU id(0) HzSQuantizedConv 0.998745 1.546072
    Conv_54 BPU id(0) HzSQuantizedConv 0.996856 1.983660
    Conv_56 BPU id(0) HzSQuantizedConv 0.998637 1.389055
    Conv_59 BPU id(0) HzSQuantizedConv 0.996533 2.238050
    Conv_61 BPU id(0) HzSQuantizedConv 0.998524 1.433360
    Conv_64 BPU id(0) HzSQuantizedConv 0.996596 2.703087
    Conv_66 BPU id(0) HzSQuantizedConv 0.998297 1.456939
    Conv_69 BPU id(0) HzSQuantizedConv 0.995955 3.479296
    Conv_71 BPU id(0) HzSQuantizedConv 0.997830 1.801756
    Conv_74 BPU id(0) HzSQuantizedConv 0.997351 3.761249
    Conv_76 BPU id(0) HzSQuantizedConv 0.997645 1.968938
    Conv_77 BPU id(0) HzSQuantizedConv 0.997343 3.761249
    Conv_80 BPU id(0) HzSQuantizedConv 0.996631 1.828595
    Conv_82 BPU id(0) HzSQuantizedConv 0.996716 1.301091
    Conv_85 BPU id(0) HzSQuantizedConv 0.995967 1.832776
    Conv_87 BPU id(0) HzSQuantizedConv 0.998240 1.269786
    Relu_89 CPU – Relu 0.995270 –
    ReduceMean_90 CPU – ReduceMean 0.998527 –
    ReduceMean_90_reshape CPU – Reshape
    ReduceMean_91 CPU – ReduceMean 0.998527 –
    Sub_96 CPU – Sub 0.992839 –
    Mul_97 CPU – Mul 0.992666 1.023778
    ReduceMean_98 CPU – ReduceMean 0.997092 –
    ReduceMean_98_reshape CPU – Reshape
    Mul_100 CPU – Mul 0.997092 –
    Div_103 CPU – Div 0.997092 –
    Add_105 CPU – Add 0.997092 –
    Sqrt_106 CPU – Sqrt 0.998121 –
    Flatten_107 CPU – Flatten 0.998527 –
    Flatten_108 CPU – Flatten 0.998121 –
    Concat_109 CPU – Concat 0.998287 –
    Gemm_110_pre_reshape CPU – Reshape
    Gemm_110 BPU id(1) HzSQuantizedConv 0.997718 0.954655
    Gemm_110_NHWC2NCHW_LayoutConvert_Output0_reshape CPU – Reshape
    Sub_111 CPU – Sub 0.997718 –
    2022-10-18 15:05:03,882 e[92mINFOe[0m The quantify model output:

    Node Cosine Similarity L1 Distance L2 Distance Chebyshev Distance

    Sub_111 0.997718 0.007392 0.000591 0.030291

  6. Board test After the conversion, we can get the corresponding bin file. The author uses C++ to write the inference code on the board, after writing on the linux machine, after cmake is compiled, the relevant static dependencies are copied to the board to run. The associated runtime C++ code will later be open sourced to Wespeaker.

Final effect

Enter two audio and determine whether they are the same speaker.

I20221031 17:23:30.933823  8448 bpu_speaker_model.cc:86] Model_path:speaker_resnet34_xj3.bin
I20221031 17:23:30.933848  8448 asv_api_main.cc:29] Init model ...
I20221031 17:23:30.934250  8448 asv_api_main.cc:35] enroll embedding ...
I20221031 17:23:30.994343  8448 bpu_speaker_model.cc:115] feat_size: 198
I20221031 17:23:31.025279  8448 bpu_speaker_model.cc:134] 256
I20221031 17:23:31.025346  8448 bpu_speaker_model.cc:115] feat_size: 198
I20221031 17:23:31.054265  8448 bpu_speaker_model.cc:134] 256
I20221031 17:23:31.054364  8448 bpu_speaker_model.cc:115] feat_size: 198
I20221031 17:23:31.083324  8448 bpu_speaker_model.cc:134] 256
I20221031 17:23:31.094725  8448 speaker_api.cc:49] over enroll ...
I20221031 17:23:31.095248  8448 asv_api_main.cc:45] compute score ...
I20221031 17:23:31.098816  8448 speaker_api.cc:87] Read enroll embedding ...
I20221031 17:23:31.098871  8448 speaker_api.cc:92] Extracting test embedding ...
I20221031 17:23:31.158486  8448 bpu_speaker_model.cc:115] feat_size: 198
I20221031 17:23:31.187544  8448 bpu_speaker_model.cc:134] 256
I20221031 17:23:31.187639  8448 bpu_speaker_model.cc:115] feat_size: 198
I20221031 17:23:31.216475  8448 bpu_speaker_model.cc:134] 256
I20221031 17:23:31.216565  8448 bpu_speaker_model.cc:115] feat_size: 198
I20221031 17:23:31.245415  8448 bpu_speaker_model.cc:134] 256
I20221031 17:23:31.245725  8448 speaker_api.cc:94] 256
I20221031 17:23:31.245760  8448 asv_api_main.cc:49] Cosine socre: 0.8165
I20221031 17:23:31.245805  8448 asv_api_main.cc:51] It's the same speaker!