如何将 boto3 输出添加到新文件中并使用一个脚本将其上传回 AWS s3 存储桶？

Question

The purpose of the code below is to read a pdf file located in s3 bucket and list the values of the pdf in the terminal.下面代码的目的是读取位于s3 bucket中的pdf文件并列出终端中pdf的值。 My end goal is to load those values into a csv/xlsx and upload it to the same s3 bucket .我的最终目标是将这些值加载到csv/xlsx并将其上传到同一个s3 bucket 。 In other words, this is a file conversion from pdf to xlsx .换句话说，这是从pdf到xlsx的文件转换。

Adding item to_excel at the end is not loading the data to excel, any suggestions?最后添加项目 to_excel 不会将数据加载到excel，有什么建议吗？ the code below is only creating an empty xlsx file on the local directory but I need it to do the following:下面的代码只是在本地目录上创建一个空的xlsx文件，但我需要它来执行以下操作：

save the data in xlsx listed in the terminal from reading the pdf located in s3通过读取位于s3中的 pdf 将数据保存在终端中列出的 xlsx 中
take that xlsx file that has the terminal data and upload it back to s3获取具有终端数据的xlsx文件并将其上传回s3

import boto3
import time
import pandas as pd

# Textract APIs used - "start_document_text_detection", "get_document_text_detection"
def InvokeTextDetectJob(s3BucketName, objectName):
    response = None
    client = boto3.client('textract')
    response = client.start_document_text_detection(
            DocumentLocation={
                      'S3Object': {
                              `enter code here`      'Bucket': s3BucketName,
                                    'Name': objectName
                                }
           })
    return response["JobId"]

def CheckJobComplete(jobId):
    time.sleep(5)
    client = boto3.client('textract')
    response = client.get_document_text_detection(JobId=jobId)
    status = response["JobStatus"]
    print("Job status: {}".format(status))
    while(status == "IN_PROGRESS"):
        time.sleep(5)
        response = client.get_document_text_detection(JobId=jobId)
        status = response["JobStatus"]
        print("Job status: {}".format(status))
    return status

def JobResults(jobId):
    pages = []
    client = boto3.client('textract')
    response = client.get_document_text_detection(JobId=jobId)
 
    pages.append(response)
    print("Resultset page recieved: {}".format(len(pages)))
    nextToken = None
    if('NextToken' in response):
        nextToken = response['NextToken']
        while(nextToken):
            response = client.get_document_text_detection(JobId=jobId, NextToken=nextToken)
            pages.append(response)
            print("Resultset page recieved: {}".format(len(pages)))
            nextToken = None
            if('NextToken' in response):
                nextToken = response['NextToken']
    return pages

# S3 Document Data
s3BucketName = "pdfbucket"
documentName = "pdf"

# Function invokes
jobId = InvokeTextDetectJob(s3BucketName, documentName)
print("Started job with id: {}".format(jobId))
if(CheckJobComplete(jobId)):
    response = JobResults(jobId)
    for resultPage in response:
        for item in resultPage["Blocks"]:
            if item["BlockType"] == "LINE":
                print (item["Text"])

with pd.ExcelWriter('output_cp.xlsx') as writer:
    item.to_excel(writer, sheetName='Sheet1')

Answer 1

So after some more research and checking with other people, below is the final product.因此，经过更多的研究和与其他人的检查，下面是最终产品。

The purpose of this code is the following:这段代码的目的如下：

Read a pdf file stored in an AWS S3 Bucket读取存储在AWS S3 Bucket中的pdf文件
After reading the file, save the data as text in a new xlsx file读取文件后，将数据作为文本保存在新的xlsx文件中

import boto3
import time
import pandas as pd
from xlsxwriter import Workbook

# Textract APIs used - "start_document_text_detection", "get_document_text_detection"


def InvokeTextDetectJob(s3BucketName, objectName):
    response = None
    client = boto3.client('textract')
    response = client.start_document_text_detection(
        DocumentLocation={
            'S3Object': {
                'Bucket': s3BucketName,
                'Name': objectName
            }
        })
    return response["JobId"]


def CheckJobComplete(jobId):
    time.sleep(5)
    client = boto3.client('textract')
    response = client.get_document_text_detection(JobId=jobId)
    status = response["JobStatus"]
    print("Job status: {}".format(status))
    while(status == "IN_PROGRESS"):
        time.sleep(5)
        response = client.get_document_text_detection(JobId=jobId)
        status = response["JobStatus"]
        print("Job status: {}".format(status))
    return status


def JobResults(jobId):
    pages = []
    client = boto3.client('textract')
    response = client.get_document_text_detection(JobId=jobId)

    pages.append(response)
    print("Resultset page recieved: {}".format(len(pages)))
    nextToken = None
    if('NextToken' in response):
        nextToken = response['NextToken']
        while(nextToken):
            response = client.get_document_text_detection(
                JobId=jobId, NextToken=nextToken)
            pages.append(response)
            print("Resultset page recieved: {}".format(len(pages)))
            nextToken = None
            if('NextToken' in response):
                nextToken = response['NextToken']
    return pages


# S3 Document Data
s3BucketName = "cpaypdf"
documentName = "cp.pdf"

# Function invokes
jobId = InvokeTextDetectJob(s3BucketName, documentName)
print("Started job with id: {}".format(jobId))
if(CheckJobComplete(jobId)):
    response = JobResults(jobId)
    df = pd.DataFrame(
        columns=[
            "Text"
        ]
    )
    for resultPage in response:
        for item in resultPage["Blocks"]:
            if item["BlockType"] == "LINE":
                df = df.append({
                    "Text": item['Text']
                },
                    ignore_index=True
                )
    writer = pd.ExcelWriter("result.xlsx", engine='xlsxwriter')
    df.to_excel(writer, sheet_name='sheetname', index=False)
    writer.save()

如何将 boto3 输出添加到新文件中并使用一个脚本将其上传回 AWS s3 存储桶？

问题描述

1 个解决方案

解决方案1
0 已采纳 2021-09-15 22:21:37

如何将 boto3 输出添加到新文件中并使用一个脚本将其上传回 AWS s3 存储桶？

问题描述

1 个解决方案

解决方案1 0 已采纳 2021-09-15 22:21:37

解决方案1
0 已采纳 2021-09-15 22:21:37