简体   繁体   中英

AWS textract multipage PDF only extract 1st page for Form and Table extraction

I am using AWS Textract for Form and Table extraction using following code. For some pdf it extracts forms from all the pages but for some pdf is extracts only first page. While using the textract user interface it extracts all the pages. What could be the reason for this??

I am using following code which is available on aws.

def create_client(access_key, secret_key):
    return boto3.client('textract',region_name='us-east-2', 
            aws_access_key_id=access_key, 
            aws_secret_access_key=secret_key)

def isJobComplete(jobId):
    client = create_client(access_key, secret_key)
    response = client.get_document_analysis(JobId=jobId)
    status = response["JobStatus"]
    print("Job status: {}".format(status))
    while(status == "IN_PROGRESS"):
        time.sleep(2)
        response = client.get_document_analysis(JobId=jobId)
        status = response["JobStatus"]
        print("Job status: {}".format(status))
    return status
    
def getJobResults(jobId):
    client = create_client(access_key, secret_key)
    response = client.get_document_analysis(JobId=jobId)
    return response

Edited: It looks like its related to response size. The size is almost fixed.

Can anyone help me with this?

Found the solution...

There is one parameter called nexttoken. Form the current response you can take nexttoken value and use that as a parameter in get_document_analysis and iterate till nexttoken is None. You will get the batch of responses.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM