[Sunrise X3] Implement OCR Deployment (DBNet + CRNN)

1. Preface

Text detection and recognition is one of the classic tasks in CV direction, which mainly includes two steps, text box detection and character recognition. This paper will implement the following three parts respectively on x3, text detection, character recognition, text detection + character recognition for everyone to choose, and provide onnxruntime and pytorch code deployment code. At the same time, it is the first time to use the grayscale map as the input of the model on x3, and the configuration file and data set preparation will also be given in the quantization, hoping that x3 can play its value in more areas.

Text detection model USES is dbnet:https://github.com/SURFZJY/Real-time-Text-Detection-DBNet

Character recognition model USES is crnn:https://github.com/meijieru/crnn.pytorch

The test code for this article is at https://github.com/Rex-LK/ai_arm_learning

2 dbnet text detection

2.1 Introduction to the model

dbnet is a widely used model for text detection. Its essence is a segmentation model, which will eventually segment the text area on the image. Its model structure is as follows:

dbnet.png

2.2. Model quantification

  • Export onnx

The code to export the onnx model is provided in x3/ocr/dbnet/predict.py.

  • Export bin

dbnet’s quantization process is consistent with the common quantization process, and the configuration file is as follows:

model_parameters:
  onnx_model: 'dbnet_simp.onnx'
  output_model_file_prefix: 'dbnet_simp'
  march: 'bernoulli2'
input_parameters:
  input_type_train: 'rgb'
  input_layout_train: 'NCHW'
  # 'rgb' / 'nv12' / yuv444 / 'bgr'
  input_type_rt: 'nv12'
  norm_type: 'data_scale'
  scale_value: 0.003921568627451
  input_layout_rt: 'NHWC'
calibration_parameters:
  cal_data_dir: './calibration_data_yuv_f32'
  calibration_type: 'max'
  max_percentile: 0.9999
compiler_parameters:
  compile_mode: 'latency'
  optimize_level: 'O3'
  debug: False
  core_num: 2

2.3. dbnet test code

onnxruntime test code is x3/ocr/dbnet/infer_onnxruntime.py, the following is the inference code,

def predict(self, img, d_size = (640,640), min_area: int = 100):
        img0_h,img0_w = img.shape[:2]
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        # 图片预处理
        img = cv2.resize(img,d_size)
        img = img / 255
        img = img.astype(np.float32)
        img = img.transpose(2,0,1)
        img = np.ascontiguousarray(img)
        img = img[None, ...]
        preds = self.model.run(["output"], {"image": img})[0]
        preds = torch.from_numpy(preds[0])
        scale = (preds.shape[2] / img0_w, preds.shape[1] / img0_h)
        start = time.time()
        prob_map, thres_map = preds[0], preds[1]
        out = (prob_map > self.thr).float() * 255
        out = out.data.cpu().numpy().astype(np.uint8)
        contours, hierarchy = cv2.findContours(out, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
        contours = [(i / scale).astype(np.int) for i in contours if len(i)>=4] 
        dilated_polys = []
        for poly in contours:
            poly = poly[:,0,:]
            D_prime = cv2.contourArea(poly) * self.ratio_prime / cv2.arcLength(poly, True) # formula(10) in the thesis
            pco = pyclipper.PyclipperOffset()
            pco.AddPath(poly, pyclipper.JT_ROUND, pyclipper.ET_CLOSEDPOLYGON)
            dilated_poly = np.array(pco.Execute(D_prime))
            if dilated_poly.size == 0 or dilated_poly.dtype != np.int or len(dilated_poly) != 1:
                continue
            dilated_polys.append(dilated_poly)
        boxes_list = []
        for cnt in dilated_polys:
            if cv2.contourArea(cnt) < min_area:
                continue
            rect = cv2.minAreaRect(cnt)
            box = (cv2.boxPoints(rect)).astype(np.int)
            boxes_list.append(box) 
        t = time.time() - start
        boxes_list = np.array(boxes_list)
        return dilated_polys, boxes_list, t

2.4、 results contour.png

The closed curve in green is the text box area recognized by dbnet.

3 crnn character recognition

3.1 Introduction to the model

crnn is a convolutional recurrent neural network, which realizes the combination of CNN and LSTM to extract image features, and finally uses ctc decoding to realize text recognition. The model structure is as follows:

crnn.jpeg

3.2. Model quantification

  • Export onnx

x3/crnn/demo.py provides the code to export the onnx model. One thing is different from other models :crnn accepts grayscale image as the input of the model with dimension [1,1,32,100].

  • Prepare grayscale calibration data

    import os
    import cv2
    import numpy as np
    
    src_root = '../../../01_common/calibration_data/coco/'
    cal_img_num = 100 
    dst_root = 'calibration_data'
    num_count = 0
    img_names = []
    for src_name in sorted(os.listdir(src_root)):
        if num_count > cal_img_num:
            break
        img_names.append(src_name)
        num_count += 1
    if not os.path.exists(dst_root):
        os.system('mkdir {0}'.format(dst_root))
    def imequalresize(img, target_size, pad_value=127.):
        target_w, target_h = target_size
        image_h, image_w = img.shape[:2]
        img_channel = 3 if len(img.shape) > 2 else 1
        scale = min(target_w * 1.0 / image_w, target_h * 1.0 / image_h)
        new_h, new_w = int(scale * image_h), int(scale * image_w)
    
        resize_image = cv2.resize(img, (new_w, new_h))
        pad_image = np.full(shape=[target_h, target_w, img_channel], fill_value=pad_value)
        dw, dh = (target_w - new_w) // 2, (target_h - new_h) // 2
        pad_image[dh:new_h + dh, dw:new_w + dw, :] = resize_image
        return pad_image
    for each_imgname in img_names:
        img_path = os.path.join(src_root, each_imgname)
        img = cv2.imread(img_path)  # BRG, HWC
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)  # RGB, HWC
        img = imequalresize(img, (100, 32))
        img = img.astype(np.uint8)
        img = cv2.cvtColor(img,cv2.COLOR_RGB2GRAY)
        dst_path = os.path.join(dst_root, each_imgname + '.rgbchw')
        print("write:%s" % dst_path)
        img.astype(np.uint8).tofile(dst_path) 
    print('finish')
    
  • config.yaml

    model_parameters:
      onnx_model: 'crnn_simp.onnx'
      output_model_file_prefix: 'crnn_simp'
      march: 'bernoulli2'
    input_parameters:
      input_type_train: 'gray'
      input_layout_train: 'NCHW'
      # 'rgb' / 'nv12' / yuv444 / 'bgr'
      input_type_rt: 'gray'
      norm_type: 'data_scale'
      scale_value: 0.0078125
      input_layout_rt: 'NHWC'
    calibration_parameters:
      cal_data_dir: './calibration_data'
      calibration_type: 'max'
      max_percentile: 0.9999
    compiler_parameters:
      compile_mode: 'latency'
      optimize_level: 'O3'
      debug: False
      core_num: 2
    

3.3 crnn test code

The crnn test file is x3/ocr/crnn/infer_onnxruntime.py

def predict(self,img):
        # 灰度
        image_input = self.preprocess_gray(img,(100,32))
        print(image_input.shape)
        preds = self.model.run(["output"], {"image": image_input})[0]
        preds = torch.from_numpy(preds)
        _, preds = preds.max(2)
        preds = preds.transpose(1, 0).contiguous().view(-1)
        preds_size = Variable(torch.IntTensor([preds.size(0)]))
        raw_pred = self.converter.decode(preds.data, preds_size.data, raw=True)
        sim_pred = self.converter.decode(preds.data, preds_size.data, raw=False)
        return raw_pred,sim_pred

4 Text detection + character recognition

4.1 Identification Process

In the actual project process, text detection and character recognition two processes are often bound together, first through the text detection model to find the text area, and then use crnn to identify the text area inside the content, the following will be through a simple example to achieve this process.

4.2 Test Code

Take x3 test code as an example :x3/ocr/demo_x3.py

dbnet = dbnet_model(args.dbnet_path) 
    # 该repo的crnn可以识别 数字以及字母
    alphabet = '0123456789abcdefghijklmnopqrstuvwxyz'
    converter = strLabelConverter(alphabet)   
    crnn = crnn_model(args.crnn_path,converter)
    img0 = cv2.imread(args.image_path) 
    img_rec = img0.copy()
       img0_h,img0_w = img0.shape[:2]
    # boxes_list 为所有文本框区域
    contours, boxes_list, t = dbnet.predict(img0)
    for i,box in enumerate(boxes_list):
        mask_t = np.zeros((img0_h, img0_w), dtype=np.uint8)
        # 将某一个文本区域单独提取出来
        cv2.fillPoly(mask_t, [box], (255), 8, 0)
        pick_img = cv2.bitwise_and(img0, img0, mask=mask_t)
        x, y, w, h = cv2.boundingRect(box)
        crnn_infer_img =  pick_img[y:y+h,x:x+w,:]
        crnn_infer_img = cv2.cvtColor(crnn_infer_img,cv2.COLOR_BGR2GRAY)
        # crnn 识别
        raw_pred,sim_pred = crnn.predict(crnn_infer_img)
        print('%-20s => %-20s' % (raw_pred, sim_pred))
        if args.output_folder:
            cv2.putText(img_rec, sim_pred, (x,y+20), cv2.FONT_HERSHEY_SIMPLEX, 0.75, (0, 0, 255), 1)
    if args.output_folder:
        img_det = img0[:, :, ::-1]
        imgc = img_det.copy()
        cv2.drawContours(imgc, contours, -1, (22,222,22), 1, cv2.LINE_AA)
        cv2.imwrite(args.output_folder + '/contour.png', imgc)
        img_draw = draw_bbox(img_rec, boxes_list)
        cv2.imwrite(args.output_folder + '/predict.jpg', img_draw)

4.3 results

predict.png From the result, dbnet can accurately predict the text area, of course, it can also adjust the size of the text area by adjusting the threshold of dbnet, and crnn can well identify the English letters in the text area.

5, summary

In this paper, a model of text detection and character recognition in the OCR algorithm is implemented on x3, while reviewing some algorithms studied before, and finally hope that x3 extends to more fields and creates greater value.