Unsupported Document format while using Amazon Textract,

Question

When i try to parse pdf file accessed via amazon s3, it gives me an error, Request has unsupported document format.

i am using Amazon textract with boto3. When i try to parse pdf file accessed via amazon s3, it gives me an error, Request has unsupported do cument format. I am fairly new to this, in the documentation of textract it is mentioned that pdf files are indeed supported.

This is the code i am using.

import boto3
textractClient = boto3.client('textract',region_name='us-east-1')
response = textractClient.detect_document_text(
        Document={'S3Object': {'Bucket': 'bucketName', 'Name': 'filename.pdf'}})
blocks = response['Blocks']

This gives me the error,Request has unsupported document format.

Answer 1

detect_document_text() is a synchronous API that only support PNG or JPG images.

If you'd like to process PDF files, you should use the asynchronous API called start_document_text_detection().

https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract.html#Textract.Client.start_document_text_detection

Unsupported Document format while using Amazon Textract,

Question

1 answers

solution1
17 2019-07-19 00:02:13

Unsupported Document format while using Amazon Textract,

Question

1 answers

solution1 17 2019-07-19 00:02:13

solution1
17 2019-07-19 00:02:13