I'm using boto3 (aws sdk for python) to analyze a document (a pdf) to get the form key:value pairs.
import boto3
def process_text_analysis(bucket, document):
# Get the document from S3
s3_connection = boto3.resource('s3')
s3_object = s3_connection.Object(bucket, document)
s3_response = s3_object.get()
# Analyze the document
client = boto3.client('textract')
response = client.analyze_document(Document={'S3Object': {'Bucket': bucket, 'Name': document}},
FeatureTypes=["FORMS"])
process_text_analysis('francismorgan-01', '709 Privado M SURESTE.pdf')
I have followed the documentation for AWS using Analyze Document and when I run my function I get the error.
botocore.errorfactory.UnsupportedDocumentException: An error occurred (UnsupportedDocumentException) when calling the AnalyzeDocument operation: Request has unsupported document format
Am I missing something?
AnalyzeDocument is a synchronous API that only supports PNG or JPG images.
Since you want to work with PDF files, then you'll need to use Amazon Textract Asynchronous API eg StartDocumentAnalysis , StartDocumentTextDetection
As the docs say
StartDocumentAnalysis can analyze text in documents that are in JPEG, PNG, TIFF, and PDF format. The documents are stored in an Amazon S3 bucket. Use DocumentLocation to specify the bucket name and file name of the document.
import boto3
client = boto3.client('textract')
response = client.start_document_analysis(
DocumentLocation={
'S3Object': {
'Bucket': 'YOUR_BUCKET_NAME',
'Name': 'YOUR_FILE_KEY_NAME'
}
},
FeatureTypes=["FORMS"]
)
# Get results from asynchronous operation
result = client.get_document_analysis(JobId=response['JobId'])
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.