简体   繁体   中英

How to remove bad characters or special character in opencv python and improve OCR accuracy?

I have built a program for extract text in image in python and OCR, but when i run the code I get some bad characters and its accuracy is not good, but it works. Can I add some datasetes about the characters that should be processed? How can I solve the problems?

This is my image:

示例图像

And this is the code:

import cv2
import numpy as np
import pytesseract

# Read input image, convert to grayscale
img = cv2.imread('9.jpg')
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# Remove shadows, cf. https://stackoverflow.com/a/44752405/11089932
dilated_img = cv2.dilate(gray, np.ones((7, 7), np.uint8))
bg_img = cv2.medianBlur(dilated_img, 21)
diff_img = 255 - cv2.absdiff(gray, bg_img)
norm_img = cv2.normalize(diff_img, None, alpha=0, beta=255,
                         norm_type=cv2.NORM_MINMAX, dtype=cv2.CV_8UC1)

# Threshold using Otsu's
work_img = cv2.threshold(norm_img, 0, 255, cv2.THRESH_OTSU)[1]

# Tesseract
custom_config = r'--oem 3 --psm 6'
text = pytesseract.image_to_string(work_img, config=custom_config)
print(text)

And finally this is the output:

fe
|Urine Analysis
| Urine analysis
| Color Yellow RBC/hpf 4-6
| Appereance Turbid WBC/hpf 2-3
; Specific Gravity 1014 Epithelial cells/Lpf 1-2
PH 7 Bacteria (Few)
| Protein Pos(+) Casts Pos(+)
Glucose Negative Mucous (Few)
Keton. Negative
Blood Pos(+)
Bilirubin Negative
' Urobilinogen Negative
| Nitrite Pos(+)

I had the similar web. I was trying to extract some information from the image but I was getting other raw text as well. So what you do is you can try an algorithm to extract only desired data.

Here is my image as input like yours Input image

Now this algorithm or code is extracting only IDs or Registration numbers of students.

Regs_No = list(new)
regs_no = []
count =0
Status = []
#Extracting Only Registration Number
for i in range(len(Regs_No)):
    if new[i][1:6] == "8MDSW":
       regs_no.append(new[i])
       Status.append('P')

So the above code is only extracting registration number.

In you case you can also use some code to get only desired text. Hope it works. Thanks.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM