[英]How to analyse PDF documents with Amazon Textract in a Synchronous way?
我想從我擁有的一堆 PDF 中提取表格。 為此,我使用 AWS Textract Python 管道。
請告知我如何在沒有 SNS 和 SQS 的情況下做到這一點? 我希望它是同步的:為我的管道提供 PDF 文件,調用 AWS Textract 並獲取結果。
這是我同時使用的,請告知我應該更改什么:
import boto3
import time
def startJob(s3BucketName, objectName):
response = None
client = boto3.client('textract')
response = client.start_document_text_detection(
DocumentLocation={
'S3Object': {
'Bucket': s3BucketName,
'Name': objectName
}
})
return response["JobId"]
def isJobComplete(jobId):
# For production use cases, use SNS based notification
# Details at: https://docs.aws.amazon.com/textract/latest/dg/api-async.html
time.sleep(5)
client = boto3.client('textract')
response = client.get_document_text_detection(JobId=jobId)
status = response["JobStatus"]
print("Job status: {}".format(status))
while(status == "IN_PROGRESS"):
time.sleep(5)
response = client.get_document_text_detection(JobId=jobId)
status = response["JobStatus"]
print("Job status: {}".format(status))
return status
def getJobResults(jobId):
pages = []
client = boto3.client('textract')
response = client.get_document_text_detection(JobId=jobId)
pages.append(response)
print("Resultset page recieved: {}".format(len(pages)))
nextToken = None
if('NextToken' in response):
nextToken = response['NextToken']
while(nextToken):
response = client.get_document_text_detection(JobId=jobId, NextToken=nextToken)
pages.append(response)
print("Resultset page recieved: {}".format(len(pages)))
nextToken = None
if('NextToken' in response):
nextToken = response['NextToken']
return pages
# Document
s3BucketName = "ki-textract-demo-docs"
documentName = "Amazon-Textract-Pdf.pdf"
jobId = startJob(s3BucketName, documentName)
print("Started job with id: {}".format(jobId))
if(isJobComplete(jobId)):
response = getJobResults(jobId)
#print(response)
# Print detected text
for resultPage in response:
for item in resultPage["Blocks"]:
if item["BlockType"] == "LINE":
print ('\033[94m' + item["Text"] + '\033[0m')
目前無法直接與 Textract 同步處理 PDF 文檔。 從文本文檔:
Amazon Textract 同步操作(
DetectDocumentText
和AnalyzeDocument
)支持 PNG 和 JPEG 圖像格式。 異步操作(StartDocumentTextDetection
、StartDocumentAnalysis
)也支持 PDF 文件格式。
一種解決方法是將 PDF 文檔轉換為代碼中的圖像,然后對這些圖像使用同步 API 操作來處理文檔。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.