简体   繁体   中英

Unsupported Document format while using Amazon Textract,

When i try to parse pdf file accessed via amazon s3, it gives me an error, Request has unsupported document format.

i am using Amazon textract with boto3. When i try to parse pdf file accessed via amazon s3, it gives me an error, Request has unsupported do cument format. I am fairly new to this, in the documentation of textract it is mentioned that pdf files are indeed supported.

This is the code i am using.

import boto3
textractClient = boto3.client('textract',region_name='us-east-1')
response = textractClient.detect_document_text(
        Document={'S3Object': {'Bucket': 'bucketName', 'Name': 'filename.pdf'}})
blocks = response['Blocks']

This gives me the error,Request has unsupported document format.

detect_document_text() is a synchronous API that only support PNG or JPG images.

If you'd like to process PDF files, you should use the asynchronous API called start_document_text_detection().

https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract.html#Textract.Client.start_document_text_detection

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM