简体   繁体   中英

How to get coordinates of the overall bounding box of a text image?

original image

原图

img = cv2.imread('eng2.png')

d = pytesseract.image_to_data(img, output_type=Output.DICT)
n_boxes = len(d['level'])
for i in range(n_boxes):
    (x, y, w, h) = (d['left'][i], d['top'][i], d['width'][i], d['height'][i])
    cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2)

plt.figure(figsize=(10,10))
plt.imshow(img)

The above code produces this image. Now in the image there are two coordinates one for each word and other for the whole text. I would like to get the coordinates for the whole text (sentences in each line or the whole paragraph

上面的代码产生了这个图像。现在在图像中有两个坐标,一个对应每个单词,另一个对应整个文本。我想获取整个文本的坐标(每行或整段中的句子)

This is what I have tried

box = pd.DataFrame(d) #dict to dataframe
box['text'].replace('', np.nan, inplace=True) #replace empty values by NaN
box= box.dropna(subset = ['text']) #delete rows with NaN 

print(box)


def lineup(boxes):
    linebox = None
    for _, box in boxes.iterrows():
        if linebox is None: linebox = box           # first line begins
        elif box.top <= linebox.top+linebox.height: # box in same line
            linebox.top = min(linebox.top, box.top)
            linebox.width = box.left+box.width-linebox.left
            linebox.heigth = max(linebox.top+linebox.height, box.top+box.height)-linebox.top
            linebox.text += ' '+box.text
        else:                                       # box in new line
            yield linebox
            linebox = box                           # new line begins
    yield linebox                                   # return last line

lineboxes = pd.DataFrame.from_records(lineup(box))

Output dataframe

输出数据帧

n_boxes = len(lineboxes['level'])
for i in range(n_boxes):
    (x, y, w, h) = (lineboxes['left'][i], lineboxes['top'][i], lineboxes['width'][i], lineboxes['height'][i])
    cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2)

plt.figure(figsize=(10,10))
plt.imshow(img)

There seems to be no difference between the original coordinates and after joining all the coordinates

原始坐标和加入所有坐标后似乎没有区别

How can i get the coordinates of the whole text (sentences in each line or the whole paragraph) using pytesseract library?

You faced a similar issue in one of your previous questions linked here . I failed to elaborate what I meant in the comments. Here is a more visual explanation.

By horizontal kernel I meant an array with single row [1, 1, 1, 1, 1] . The number of columns can be determined based on the font size and space between characters/words . Using the kernel with a morphological dilation operation you can connect individual entities that are present horizontally as a single entity.

In your case, we would like to extract each line as an individual entity. Let's go through the code:

Code:

img = cv2.imread('letter.png')
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# inverse binary image, to ensure text region is in white
# because contours are found for objects in white
th = cv2.threshold(gray,0,255,cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

在此处输入图像描述

Now there is a black border surrounding the original image. In th it becomes are white border. Since it is unwanted we will remove it using cv2.floodFill()

black = np.zeros([img.shape[0] + 2, img.shape[1] + 2], np.uint8)
mask = cv2.floodFill(th.copy(), black, (0,0), 0, 0, 0, flags=8)[1]

在此处输入图像描述

# dilation using horizontal kernel
kernel_length = 30
horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (kernel_length, 1))
dilate = cv2.dilate(mask, horizontal_kernel, iterations=1)

在此处输入图像描述

img2 = img.copy()
contours = cv2.findContours(dilate, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
contours = contours[0] if len(contours) == 2 else contours[1]
for c in contours:
  x, y, w, h = cv2.boundingRect(c)
  img2 = cv2.rectangle(img, (x, y), (x + w, y + h), (0,255,0), 2)

在此处输入图像描述

You can get the coordinates for each line from cv2.boundingRect() . This can be seen in the image above. Using those coordinates you can crop each line in the document and feed it to pytesseract library.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM