简体   繁体   中英

Python opencv remove noise from captcha

I need to resolve captcha automatically to grab the public data from sites.

I use python and opencv. I'm newbee in solving the images processing. After search, as a method to resolve captcha I came up with next. As the text in Captha uses group of related colours I try to use the HSV format and mask, then convert image to Grayscale and use Threshold (Adaptive_THRESH_MEAN_C) to remove noise from the image.

But this is not enough to remove noise and provide automatic text recognition with OCR (Tesseract). See images below.

Is there something I can improve in my solution or there is a better way?

Original images:

captcha1captcha2

Processed images:

captcha1captcha2

Code:

import cv2
import numpy as np

img = cv2.imread("1.jpeg")
hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)

mask = cv2.inRange(hsv, (36, 0, 0), (70, 255,255)) #green
# mask = cv2.inRange(hsv, (0, 0, 0), (10, 255, 255))
# mask = cv2.inRange(hsv, (125, 0, 0), (135, 255,255))

img = cv2.bitwise_and(img, img, mask=mask)
img[np.where((img == [0,0,0]).all(axis = 2))] = [255,255,255]

img = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
img = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY, 15, 2)

cv2.imwrite("out.png", img)

I think you can reach a good performance by applying some smoothing methods and after that finding image edges. Here is the code:

import cv2

img = cv2.imread("input.jpg")
# smoothing the image
img = cv2.medianBlur(img, 5)

#edge detection    
edges = cv2.Canny(img, 100, 200)
cv2.imwrite('output.png', edges)

在此处输入图片说明 在此处输入图片说明

在此处输入图片说明 在此处输入图片说明

You can try different approaches to achieve your goal: Your first image can be processed via the application of a median filter (r=2), followed by adaptive thresholding: 带有清晰文字的已处理图像

The binary option of Opening would be another option one could try: 带有清晰文本的已处理图像。文字的质量比图片低 .

Note that the quality is lower than with the first approach (especially the last G is visibily degraded).

The second image responds different to the treatment than the first one:

For the median approach:

在此处输入图片说明

For opening:

第二张图片

However, it is possible to extract the text via the application of a median blur (r=1), followed by auto-contrast and then thresholding with 50:

图像二降低了噪点

As you can see, it is possible to improve the quality of your images enough be recognizable. The first image can be converted to text without problem, but the second one can only be recognized partially.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM