[英]boto3 textract start_document_text_detection doesn't accept folders for input files on s3
[英]Having difficulties using the QUERY option in "textract" start document analisys with boto3 python
我的問題是 textract 異步方法 start_document_analysis,可以選擇您要執行的分析類型,但是當我嘗試使用“查詢”功能時 =>
FeatureTypes=[
'TABLES'|'FORMS'|'QUERIES',
],
您必須使用查詢列表傳遞另一個參數 =>
QueriesConfig={
'Queries': [
{
'Text': 'string',
'Alias': 'string',
'Pages': [
'string',
]
},
]
}
一旦我傳遞了這個參數,boto3 就會拋出一個異常,即 Queries config 不被識別為接受的參數之一,有沒有人在 python 之前使用過這個功能?
您可以通過這種方式使用:
def getJobResults(jobId):
pages = []
client = boto3.client('textract')
response = client.get_document_analysis(JobId=jobId)
pages.append(response)
print("Resultset page recieved: {}".format(len(pages)))
nextToken = None
if('NextToken' in response):
nextToken = response['NextToken']
while(nextToken):
response = client.get_document_analysis(JobId=jobId, NextToken=nextToken)
pages.append(response)
print("Resultset page recieved: {}".format(len(pages)))
nextToken = None
if('NextToken' in response):
nextToken = response['NextToken']
return pages
def get_kv_map(s3BucketName, documentName):
client = boto3.client('textract')
response = client.start_document_analysis(
DocumentLocation={
'S3Object': {
'Bucket': s3BucketName,
'Name': documentName
}
},
FeatureTypes=['QUERIES'],
QueriesConfig={
'Queries': [
{
"Text": "is 1. A. checkbox seleted"
}
]
}
)
job_id = response['JobId']
response = client.get_document_analysis(JobId=job_id)
status = response["JobStatus"]
while(status == "IN_PROGRESS"):
time.sleep(3)
response = client.get_document_analysis(JobId=job_id)
status = response["JobStatus"]
print("Job status2: {}".format(status))
response = getJobResults(job_id)
return response
def query_extraction():
s3BucketName = "bucket-name"
documentName = "xyz.pdf"
data = get_kv_map(s3BucketName, documentName)
return data
data = query_extraction()
希望這能解決您的問題
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.