AWS Textract - UnsupportedDocumentException - PDF

Question

I'm using boto3 (aws sdk for python) to analyze a document (a pdf) to get the form key:value pairs.我正在使用 boto3（用于 python 的 aws sdk）分析文档（pdf）以获取表单键：值对。

import boto3

def process_text_analysis(bucket, document):
    # Get the document from S3
    s3_connection = boto3.resource('s3')
    s3_object = s3_connection.Object(bucket, document)
    s3_response = s3_object.get()
    # Analyze the document
    client = boto3.client('textract')
    response = client.analyze_document(Document={'S3Object': {'Bucket': bucket, 'Name': document}},
                                       FeatureTypes=["FORMS"])


process_text_analysis('francismorgan-01', '709 Privado M SURESTE.pdf')

I have followed the documentation for AWS using Analyze Document and when I run my function I get the error.我使用分析文档遵循了 AWS 的文档，当我运行我的函数时出现错误。

botocore.errorfactory.UnsupportedDocumentException: An error occurred (UnsupportedDocumentException) when calling the AnalyzeDocument operation: Request has unsupported document format

Am I missing something?我错过了什么吗？

Answer 1

AnalyzeDocument is a synchronous API that only supports PNG or JPG images. AnalyzeDocument是一个同步 API，仅支持 PNG 或 JPG 图像。

Since you want to work with PDF files, then you'll need to use Amazon Textract Asynchronous API eg StartDocumentAnalysis , StartDocumentTextDetection由于您想要处理 PDF 文件，因此您需要使用 Amazon Textract Asynchronous API，例如StartDocumentAnalysis 、 StartDocumentTextDetection

Answer 2

As the docs say正如文档所说

StartDocumentAnalysis can analyze text in documents that are in JPEG, PNG, TIFF, and PDF format. StartDocumentAnalysis 可以分析 JPEG、PNG、TIFF 和 PDF 格式的文档中的文本。 The documents are stored in an Amazon S3 bucket.文档存储在 Amazon S3 存储桶中。 Use DocumentLocation to specify the bucket name and file name of the document.使用 DocumentLocation 指定文档的存储桶名称和文件名。

Boto3 Example Boto3 示例

import boto3

client = boto3.client('textract')

response = client.start_document_analysis(
    DocumentLocation={
        'S3Object': {
            'Bucket': 'YOUR_BUCKET_NAME',
            'Name': 'YOUR_FILE_KEY_NAME'
        }
    },
    FeatureTypes=["FORMS"]
)

# Get results from asynchronous operation
result = client.get_document_analysis(JobId=response['JobId'])

AWS Textract - UnsupportedDocumentException - PDF

问题描述

2 个解决方案

解决方案1
3 已采纳 2020-03-03 14:15:14

解决方案2
0 2021-11-04 22:42:00

Boto3 Example Boto3 示例

AWS Textract - UnsupportedDocumentException - PDF

问题描述

2 个解决方案

解决方案1 3 已采纳 2020-03-03 14:15:14

解决方案2 0 2021-11-04 22:42:00

Boto3 Example Boto3 示例

解决方案1
3 已采纳 2020-03-03 14:15:14

解决方案2
0 2021-11-04 22:42:00