如何使用 Amazon Textract 以同步方式分析 PDF 文檔？

Question

我想從我擁有的一堆 PDF 中提取表格。 為此，我使用 AWS Textract Python 管道。

請告知我如何在沒有 SNS 和 SQS 的情況下做到這一點？ 我希望它是同步的：為我的管道提供 PDF 文件，調用 AWS Textract 並獲取結果。

這是我同時使用的，請告知我應該更改什么：

import boto3
import time

def startJob(s3BucketName, objectName):
    response = None
    client = boto3.client('textract')
    response = client.start_document_text_detection(
    DocumentLocation={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': objectName
        }
    })

    return response["JobId"]

def isJobComplete(jobId):
    # For production use cases, use SNS based notification 
    # Details at: https://docs.aws.amazon.com/textract/latest/dg/api-async.html
    time.sleep(5)
    client = boto3.client('textract')
    response = client.get_document_text_detection(JobId=jobId)
    status = response["JobStatus"]
    print("Job status: {}".format(status))

    while(status == "IN_PROGRESS"):
        time.sleep(5)
        response = client.get_document_text_detection(JobId=jobId)
        status = response["JobStatus"]
        print("Job status: {}".format(status))

    return status

def getJobResults(jobId):

    pages = []

    client = boto3.client('textract')
    response = client.get_document_text_detection(JobId=jobId)

    pages.append(response)
    print("Resultset page recieved: {}".format(len(pages)))
    nextToken = None
    if('NextToken' in response):
        nextToken = response['NextToken']

    while(nextToken):

        response = client.get_document_text_detection(JobId=jobId, NextToken=nextToken)

        pages.append(response)
        print("Resultset page recieved: {}".format(len(pages)))
        nextToken = None
        if('NextToken' in response):
            nextToken = response['NextToken']

    return pages

# Document
s3BucketName = "ki-textract-demo-docs"
documentName = "Amazon-Textract-Pdf.pdf"

jobId = startJob(s3BucketName, documentName)
print("Started job with id: {}".format(jobId))
if(isJobComplete(jobId)):
    response = getJobResults(jobId)

#print(response)

# Print detected text
for resultPage in response:
    for item in resultPage["Blocks"]:
        if item["BlockType"] == "LINE":
            print ('\033[94m' +  item["Text"] + '\033[0m')

Answer 1

目前無法直接與 Textract 同步處理 PDF 文檔。 從文本文檔：

Amazon Textract 同步操作（ DetectDocumentText和AnalyzeDocument ）支持 PNG 和 JPEG 圖像格式。 異步操作（ StartDocumentTextDetection 、 StartDocumentAnalysis ）也支持 PDF 文件格式。

一種解決方法是將 PDF 文檔轉換為代碼中的圖像，然后對這些圖像使用同步 API 操作來處理文檔。

如何使用 Amazon Textract 以同步方式分析 PDF 文檔？

問題描述

1 個解決方案

解決方案1
2 已采納 2020-06-03 13:43:06

如何使用 Amazon Textract 以同步方式分析 PDF 文檔？

問題描述

1 個解決方案

解決方案1 2 已采納 2020-06-03 13:43:06

解決方案1
2 已采納 2020-06-03 13:43:06