[英]Hot get the output in a CSV-File? (Python and OCR)
该代码在扫描文件中搜索特定关键字并输出其后的单词,但仅在控制台中。 现在我的问题是我想将这些东西以 CSV 格式输出。 有人可以帮助我了解如何以 CSV 格式获取输出吗?
pytesseract.pytesseract.tesseract = "D:\\Users\\Dekt\\tesseract.exe"
Data = 'D:\\Users\\files\\example.pdf'
doc = convert_from_path(Data)
path, fileName = os.path.split(Data)
fileBaseName, fileExtension = os.path.splitext(fileName)
for page_number, page_data in enumerate(doc):
txt = pytesseract.image_to_string(page_data, lang='deu').encode('utf-8')
txt = txt.decode('utf-8')
tokens = txt.split()
if "Name" in tokens:
location = tokens.index('Name')
print("Name: " + (tokens[location + 1]) + " " + (tokens[location + 2]) + " " + (
tokens[location + 3]))
´´´
不要使用print()
,而是以写入模式打开文件并将内容写入文件。
from pdf2image import convert_from_path
import os
import pytesseract
from PIL import Image
output = open("myCSV.csv", "w")
pytesseract.pytesseract.tesseract_cmd = "D:\\Users\\Dekt\\tesseract.exe"
filePath = 'D:\\Users\\files\\example.pdf'
doc = convert_from_path(filePath)
path, fileName = os.path.split(filePath)
fileBaseName, fileExtension = os.path.splitext(fileName)
for page_number, page_data in enumerate(doc):
txt = pytesseract.image_to_string(page_data, lang='deu').encode('utf-8')
txt = txt.decode('utf-8')
tokens = txt.split()
if "Name" in tokens:
location = tokens.index('Name')
output.write("Name: " + (tokens[location + 1]) + " " + (tokens[location + 2]) + " " + (
tokens[location + 3]) + ",")
if "Date" in tokens:
location = tokens.index('Date')
output.write("Date is : "+(tokens[location+1])+" "+(tokens[location+2])+" "+(tokens[location+3]) + ",")
if "Adress" in tokens:
location = tokens.index('Adress')
output.write("Adress is : "+(tokens[location+1])+" "+(tokens[location+2])+" "+(tokens[location+3]) + ",")
我在每条语句的末尾添加了一个逗号,因为我不知道您在格式中到底要查找什么。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.