AWS Textract - UnsupportedDocumentException - PDF

Question

I'm using boto3 (aws sdk for python) to analyze a document (a pdf) to get the form key:value pairs.

import boto3

def process_text_analysis(bucket, document):
    # Get the document from S3
    s3_connection = boto3.resource('s3')
    s3_object = s3_connection.Object(bucket, document)
    s3_response = s3_object.get()
    # Analyze the document
    client = boto3.client('textract')
    response = client.analyze_document(Document={'S3Object': {'Bucket': bucket, 'Name': document}},
                                       FeatureTypes=["FORMS"])


process_text_analysis('francismorgan-01', '709 Privado M SURESTE.pdf')

I have followed the documentation for AWS using Analyze Document and when I run my function I get the error.

botocore.errorfactory.UnsupportedDocumentException: An error occurred (UnsupportedDocumentException) when calling the AnalyzeDocument operation: Request has unsupported document format

Am I missing something?

Answer 1

AnalyzeDocument is a synchronous API that only supports PNG or JPG images.

Since you want to work with PDF files, then you'll need to use Amazon Textract Asynchronous API eg StartDocumentAnalysis , StartDocumentTextDetection

Answer 2

As the docs say

StartDocumentAnalysis can analyze text in documents that are in JPEG, PNG, TIFF, and PDF format. The documents are stored in an Amazon S3 bucket. Use DocumentLocation to specify the bucket name and file name of the document.

Boto3 Example

import boto3

client = boto3.client('textract')

response = client.start_document_analysis(
    DocumentLocation={
        'S3Object': {
            'Bucket': 'YOUR_BUCKET_NAME',
            'Name': 'YOUR_FILE_KEY_NAME'
        }
    },
    FeatureTypes=["FORMS"]
)

# Get results from asynchronous operation
result = client.get_document_analysis(JobId=response['JobId'])

AWS Textract - UnsupportedDocumentException - PDF

Question

2 answers

solution1
3 ACCPTED 2020-03-03 14:15:14

solution2
0 2021-11-04 22:42:00

Boto3 Example

AWS Textract - UnsupportedDocumentException - PDF

Question

2 answers

solution1 3 ACCPTED 2020-03-03 14:15:14

solution2 0 2021-11-04 22:42:00

Boto3 Example

solution1
3 ACCPTED 2020-03-03 14:15:14

solution2
0 2021-11-04 22:42:00