简体   繁体   中英

How to tabulate text data that is extracted from image?

I've used OpenCV and pytesseract to extract text from images, but I'm looking for a way to tabulate the text data, that is extracted into a TXT or CSV file. Currently, the output from Python is mixed up in a paragraph form.

The input image:

输入图像

The code I've used so far:

当前代码

This is the output I'm getting now:

电流输出

The output I'm expecting would be:

预期产出

I assume, you work on screenshots similar to the provided one, ie the content – principally – is always the same having fields "Lokasi", "Nama", etc.

Crop the central white part of the image, and run pytesseract . The output – principally – should also always be the same. You get some string with intermediate double new lines \n\n which you can replace by single new lines \n , and then split the string at the new lines. What's left is some parsing of the content of the single extracted lines, and storing the values appropriately, eg in some simple dictionary.

Here's some code:

import cv2
import numpy as np
import pytesseract

# Read image
img = cv2.imread('6cO7N.jpg', cv2.IMREAD_GRAYSCALE)

# Crop central white part from image
mask = (img == 255).astype(np.uint8) * 255
mask = cv2.morphologyEx(mask, cv2.MORPH_OPEN, np.full((11, 11), 255))
mask = cv2.morphologyEx(mask, cv2.MORPH_CLOSE, np.full((21, 21), 255))
x, y, w, h = cv2.boundingRect(mask)
img = img[y:y+h, x:x+w]

# Extract text, replace double new lines, and split lines
lines = pytesseract.image_to_string(img).replace('\n\n', '\n').split('\n')


# Helper function to return index of line with given content
def get_idx(texts, target):
    return [idx for idx in range(len(texts)) if texts[idx] == target][0]


# Extract data from lines
idx_nama_no_telefon = get_idx(lines, 'Nama No. Telefon')
nama_no_telefon = lines[idx_nama_no_telefon + 1].split('+')
nama = nama_no_telefon[0][:-1]
idx_tarikh_masa = get_idx(lines, 'Tarikh Masa')
for i in range(idx_nama_no_telefon + 2, idx_tarikh_masa):
    nama = nama + ' ' + lines[i]
tarikh_masa = lines[idx_tarikh_masa + 1].split(' ')

# Store data in some structure - if needed
data = {'Check-in': lines[0],
        'Lokasi': lines[get_idx(lines, 'Lokasi') + 1],
        'Nama': nama,
        'No. Telefon': '+' + nama_no_telefon[1],
        'Tarikh': ' '.join(tarikh_masa[:3]),
        'Masa': ' '.join(tarikh_masa[3:]),
        'Risiko': lines[get_idx(lines, 'Risiko') + 1]}

# Print data as desired
for k, v in list(zip(data.keys(), data.values()))[1:]:
    print('{} : {}'.format(k, v))

Since I didn't know how many lines can occur for the name (I guess?,). there's a loop for collecting all parts of it.

The output right now would be:

Lokasi : PERSATUAN PERJIRANAN PARKVIEW
Nama : ABBIMANYU A/L CHITHAMBARAM
No. Telefon : +60127658504
Tarikh : May 19, 2021
Masa : 7:32:35 PM
Risiko : Low

Attention: That's a pretty hard-coded solution here, relying on the stated assumptions. Even minor alterations in the input images may lead to false outputs.

----------------------------------------
System information
----------------------------------------
Platform:      Windows-10-10.0.19041-SP0
Python:        3.9.1
PyCharm:       2021.1.1
NumPy:         1.19.5
OpenCV:        4.5.2
pytesseract:   5.0.0-alpha.20201127
----------------------------------------

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM