AWS 文本 - UnsupportedDocumentException

Question

While implementing aws textract using boto3 for python.在使用 boto3 为 python 实现 aws texttract 时。

Code:代码：

import boto3

# Document
documentName = "/home/niranjan/IdeaProjects/amazon-forecast-samples/notebooks/basic/Tutorial/cert.pdf"

# Read document content
with open(documentName, 'rb') as document:
    imageBytes = bytearray(document.read())

print(type(imageBytes))

# Amazon Textract client
textract = boto3.client('textract', region_name='us-west-2')

# Call Amazon Textract
response = textract.detect_document_text(Document={'Bytes': imageBytes})

below are credential and config files of aws下面是 aws 的凭证和配置文件

niranjan@niranjan:~$ cat ~/.aws/credentials
[default]
aws_access_key_id=my_access_key_id
aws_secret_access_key=my_secret_access_key

niranjan@niranjan:~$ cat ~/.aws/config 
[default]
region=eu-west-1

I am getting this exception:我得到了这个例外：

---------------------------------------------------------------------------
UnsupportedDocumentException              Traceback (most recent call last)
<ipython-input-11-f52c10e3f3db> in <module>
     14 
     15 # Call Amazon Textract
---> 16 response = textract.detect_document_text(Document={'Bytes': imageBytes})
     17 
     18 #print(response)

~/venv/lib/python3.7/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
    314                     "%s() only accepts keyword arguments." % py_operation_name)
    315             # The "self" in this scope is referring to the BaseClient.
--> 316             return self._make_api_call(operation_name, kwargs)
    317 
    318         _api_call.__name__ = str(py_operation_name)

~/venv/lib/python3.7/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
    624             error_code = parsed_response.get("Error", {}).get("Code")
    625             error_class = self.exceptions.from_code(error_code)
--> 626             raise error_class(parsed_response, operation_name)
    627         else:
    628             return parsed_response

UnsupportedDocumentException: An error occurred (UnsupportedDocumentException) when calling the DetectDocumentText operation: Request has unsupported document format

I am bit new to AWS textract, any help would be much appreciated.我对 AWS textract 有点陌生，任何帮助将不胜感激。

Answer 1

As DetectDocumentText API of Textract does not support "pdf" type of document, sending pdf you encounter UnsupportedDocumentFormat Exception .由于DetectDocumentText的 DetectDocumentText API 不支持“pdf”类型的文档，发送 pdf 会遇到UnsupportedDocumentFormat Exception 。 Try to send image file instead.尝试发送图像文件。

Incase if you still want to send pdf file then you have to use Asynchronous APIs of Textract.如果您仍然想发送 pdf 文件，那么您必须使用 Textract 的异步 API。 Eg StartDocumentAnalysis API to start analysis and GetDocumentAnalysis to get analyzed document.例如StartDocumentAnalysis API 开始分析和GetDocumentAnalysis得到分析的文件。

Detects text in the input document.检测输入文档中的文本。 Amazon Textract can detect lines of text and the words that make up a line of text. Amazon Textract 可以检测文本行和构成文本行的单词。 The input document must be an image in JPEG or PNG format.输入文档必须是 JPEG 或 PNG 格式的图像。 DetectDocumentText returns the detected text in an array of Block objects. DetectDocumentText 在 Block 对象数组中返回检测到的文本。

https://docs.aws.amazon.com/textract/latest/dg/API_DetectDocumentText.html https://docs.aws.amazon.com/textract/latest/dg/API_DetectDocumentText.html

Answer 2

import boto3
import time

def startJob(s3BucketName, objectName):
    response = None
    client = boto3.client('textract')
    response = client.start_document_text_detection(
    DocumentLocation={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': objectName
        }
    })

    return response["JobId"]

def isJobComplete(jobId):
    # For production use cases, use SNS based notification 
    # Details at: https://docs.aws.amazon.com/textract/latest/dg/api-async.html
    time.sleep(5)
    client = boto3.client('textract')
    response = client.get_document_text_detection(JobId=jobId)
    status = response["JobStatus"]
    print("Job status: {}".format(status))

    while(status == "IN_PROGRESS"):
        time.sleep(5)
        response = client.get_document_text_detection(JobId=jobId)
        status = response["JobStatus"]
        print("Job status: {}".format(status))

    return status

def getJobResults(jobId):

    pages = []

    client = boto3.client('textract')
    response = client.get_document_text_detection(JobId=jobId)
    
    pages.append(response)
    print("Resultset page recieved: {}".format(len(pages)))
    nextToken = None
    if('NextToken' in response):
        nextToken = response['NextToken']

    while(nextToken):

        response = client.get_document_text_detection(JobId=jobId, NextToken=nextToken)

        pages.append(response)
        print("Resultset page recieved: {}".format(len(pages)))
        nextToken = None
        if('NextToken' in response):
            nextToken = response['NextToken']

    return pages

# Document
s3BucketName = "ki-textract-demo-docs"
documentName = "Amazon-Textract-Pdf.pdf"

jobId = startJob(s3BucketName, documentName)
print("Started job with id: {}".format(jobId))
if(isJobComplete(jobId)):
    response = getJobResults(jobId)

#print(response)

# Print detected text
for resultPage in response:
    for item in resultPage["Blocks"]:
        if item["BlockType"] == "LINE":
            print ('\033[94m' +  item["Text"] + '\033[0m')

Try this code and refer this link from AWS for explanation尝试此代码并参考 AWS 的此链接以获取解释

AWS 文本 - UnsupportedDocumentException

问题描述

2 个解决方案

解决方案1
1 已采纳 2020-04-22 06:01:47

解决方案2
0 2020-08-02 05:45:27

AWS 文本 - UnsupportedDocumentException

问题描述

2 个解决方案

解决方案1 1 已采纳 2020-04-22 06:01:47

解决方案2 0 2020-08-02 05:45:27

解决方案1
1 已采纳 2020-04-22 06:01:47

解决方案2
0 2020-08-02 05:45:27