AWS textract multipage PDF only extract 1st page for Form and Table extraction

Question

I am using AWS Textract for Form and Table extraction using following code. For some pdf it extracts forms from all the pages but for some pdf is extracts only first page. While using the textract user interface it extracts all the pages. What could be the reason for this??

I am using following code which is available on aws.

def create_client(access_key, secret_key):
    return boto3.client('textract',region_name='us-east-2', 
            aws_access_key_id=access_key, 
            aws_secret_access_key=secret_key)

def isJobComplete(jobId):
    client = create_client(access_key, secret_key)
    response = client.get_document_analysis(JobId=jobId)
    status = response["JobStatus"]
    print("Job status: {}".format(status))
    while(status == "IN_PROGRESS"):
        time.sleep(2)
        response = client.get_document_analysis(JobId=jobId)
        status = response["JobStatus"]
        print("Job status: {}".format(status))
    return status
    
def getJobResults(jobId):
    client = create_client(access_key, secret_key)
    response = client.get_document_analysis(JobId=jobId)
    return response

Edited: It looks like its related to response size. The size is almost fixed.

Can anyone help me with this?

Answer 1

Found the solution...

There is one parameter called nexttoken. Form the current response you can take nexttoken value and use that as a parameter in get_document_analysis and iterate till nexttoken is None. You will get the batch of responses.

AWS textract multipage PDF only extract 1st page for Form and Table extraction

Question

1 answers

solution1
0 2021-12-16 09:35:35

AWS textract multipage PDF only extract 1st page for Form and Table extraction

Question

1 answers

solution1 0 2021-12-16 09:35:35

solution1
0 2021-12-16 09:35:35