AWS Textract - UnsupportedDocumentException - PDF

Question

我正在使用 boto3（用於 python 的 aws sdk）分析文檔（pdf）以獲取表單鍵：值對。

import boto3

def process_text_analysis(bucket, document):
    # Get the document from S3
    s3_connection = boto3.resource('s3')
    s3_object = s3_connection.Object(bucket, document)
    s3_response = s3_object.get()
    # Analyze the document
    client = boto3.client('textract')
    response = client.analyze_document(Document={'S3Object': {'Bucket': bucket, 'Name': document}},
                                       FeatureTypes=["FORMS"])


process_text_analysis('francismorgan-01', '709 Privado M SURESTE.pdf')

我使用分析文檔遵循了 AWS 的文檔，當我運行我的函數時出現錯誤。

botocore.errorfactory.UnsupportedDocumentException: An error occurred (UnsupportedDocumentException) when calling the AnalyzeDocument operation: Request has unsupported document format

我錯過了什么嗎？

Answer 1

AnalyzeDocument是一個同步 API，僅支持 PNG 或 JPG 圖像。

由於您想要處理 PDF 文件，因此您需要使用 Amazon Textract Asynchronous API，例如StartDocumentAnalysis 、 StartDocumentTextDetection

Answer 2

正如文檔所說

StartDocumentAnalysis 可以分析 JPEG、PNG、TIFF 和 PDF 格式的文檔中的文本。 文檔存儲在 Amazon S3 存儲桶中。 使用 DocumentLocation 指定文檔的存儲桶名稱和文件名。

Boto3 示例

import boto3

client = boto3.client('textract')

response = client.start_document_analysis(
    DocumentLocation={
        'S3Object': {
            'Bucket': 'YOUR_BUCKET_NAME',
            'Name': 'YOUR_FILE_KEY_NAME'
        }
    },
    FeatureTypes=["FORMS"]
)

# Get results from asynchronous operation
result = client.get_document_analysis(JobId=response['JobId'])

AWS Textract - UnsupportedDocumentException - PDF

問題描述

2 個解決方案

解決方案1
3 已采納 2020-03-03 14:15:14

解決方案2
0 2021-11-04 22:42:00

Boto3 示例

AWS Textract - UnsupportedDocumentException - PDF

問題描述

2 個解決方案

解決方案1 3 已采納 2020-03-03 14:15:14

解決方案2 0 2021-11-04 22:42:00

Boto3 示例

解決方案1
3 已采納 2020-03-03 14:15:14

解決方案2
0 2021-11-04 22:42:00