简体   繁体   English

如何将 boto3 输出添加到新文件中并使用一个脚本将其上传回 AWS s3 存储桶?

[英]How to add boto3 output into a new file and upload it back to an AWS s3 bucket using one script?

The purpose of the code below is to read a pdf file located in s3 bucket and list the values of the pdf in the terminal.下面代码的目的是读取位于s3 bucket中的pdf文件并列出终端中pdf的值。 My end goal is to load those values into a csv/xlsx and upload it to the same s3 bucket .我的最终目标是将这些值加载到csv/xlsx并将其上传到同一个s3 bucket In other words, this is a file conversion from pdf to xlsx .换句话说,这是从pdfxlsx的文件转换。

Adding item to_excel at the end is not loading the data to excel, any suggestions?最后添加项目 to_excel 不会将数据加载到excel,有什么建议吗? the code below is only creating an empty xlsx file on the local directory but I need it to do the following:下面的代码只是在本地目录上创建一个空的xlsx文件,但我需要它来执行以下操作:

  1. save the data in xlsx listed in the terminal from reading the pdf located in s3通过读取位于s3中的 pdf 将数据保存在终端中列出的 xlsx 中
  2. take that xlsx file that has the terminal data and upload it back to s3获取具有终端数据的xlsx文件并将其上传回s3
import boto3
import time
import pandas as pd

# Textract APIs used - "start_document_text_detection", "get_document_text_detection"
def InvokeTextDetectJob(s3BucketName, objectName):
    response = None
    client = boto3.client('textract')
    response = client.start_document_text_detection(
            DocumentLocation={
                      'S3Object': {
                              `enter code here`      'Bucket': s3BucketName,
                                    'Name': objectName
                                }
           })
    return response["JobId"]

def CheckJobComplete(jobId):
    time.sleep(5)
    client = boto3.client('textract')
    response = client.get_document_text_detection(JobId=jobId)
    status = response["JobStatus"]
    print("Job status: {}".format(status))
    while(status == "IN_PROGRESS"):
        time.sleep(5)
        response = client.get_document_text_detection(JobId=jobId)
        status = response["JobStatus"]
        print("Job status: {}".format(status))
    return status

def JobResults(jobId):
    pages = []
    client = boto3.client('textract')
    response = client.get_document_text_detection(JobId=jobId)
 
    pages.append(response)
    print("Resultset page recieved: {}".format(len(pages)))
    nextToken = None
    if('NextToken' in response):
        nextToken = response['NextToken']
        while(nextToken):
            response = client.get_document_text_detection(JobId=jobId, NextToken=nextToken)
            pages.append(response)
            print("Resultset page recieved: {}".format(len(pages)))
            nextToken = None
            if('NextToken' in response):
                nextToken = response['NextToken']
    return pages

# S3 Document Data
s3BucketName = "pdfbucket"
documentName = "pdf"

# Function invokes
jobId = InvokeTextDetectJob(s3BucketName, documentName)
print("Started job with id: {}".format(jobId))
if(CheckJobComplete(jobId)):
    response = JobResults(jobId)
    for resultPage in response:
        for item in resultPage["Blocks"]:
            if item["BlockType"] == "LINE":
                print (item["Text"])

with pd.ExcelWriter('output_cp.xlsx') as writer:
    item.to_excel(writer, sheetName='Sheet1')

So after some more research and checking with other people, below is the final product.因此,经过更多的研究和与其他人的检查,下面是最终产品。

The purpose of this code is the following:这段代码的目的如下:

  1. Read a pdf file stored in an AWS S3 Bucket读取存储在AWS S3 Bucket中的pdf文件
  2. After reading the file, save the data as text in a new xlsx file读取文件后,将数据作为文本保存在新的xlsx文件中
import boto3
import time
import pandas as pd
from xlsxwriter import Workbook

# Textract APIs used - "start_document_text_detection", "get_document_text_detection"


def InvokeTextDetectJob(s3BucketName, objectName):
    response = None
    client = boto3.client('textract')
    response = client.start_document_text_detection(
        DocumentLocation={
            'S3Object': {
                'Bucket': s3BucketName,
                'Name': objectName
            }
        })
    return response["JobId"]


def CheckJobComplete(jobId):
    time.sleep(5)
    client = boto3.client('textract')
    response = client.get_document_text_detection(JobId=jobId)
    status = response["JobStatus"]
    print("Job status: {}".format(status))
    while(status == "IN_PROGRESS"):
        time.sleep(5)
        response = client.get_document_text_detection(JobId=jobId)
        status = response["JobStatus"]
        print("Job status: {}".format(status))
    return status


def JobResults(jobId):
    pages = []
    client = boto3.client('textract')
    response = client.get_document_text_detection(JobId=jobId)

    pages.append(response)
    print("Resultset page recieved: {}".format(len(pages)))
    nextToken = None
    if('NextToken' in response):
        nextToken = response['NextToken']
        while(nextToken):
            response = client.get_document_text_detection(
                JobId=jobId, NextToken=nextToken)
            pages.append(response)
            print("Resultset page recieved: {}".format(len(pages)))
            nextToken = None
            if('NextToken' in response):
                nextToken = response['NextToken']
    return pages


# S3 Document Data
s3BucketName = "cpaypdf"
documentName = "cp.pdf"

# Function invokes
jobId = InvokeTextDetectJob(s3BucketName, documentName)
print("Started job with id: {}".format(jobId))
if(CheckJobComplete(jobId)):
    response = JobResults(jobId)
    df = pd.DataFrame(
        columns=[
            "Text"
        ]
    )
    for resultPage in response:
        for item in resultPage["Blocks"]:
            if item["BlockType"] == "LINE":
                df = df.append({
                    "Text": item['Text']
                },
                    ignore_index=True
                )
    writer = pd.ExcelWriter("result.xlsx", engine='xlsxwriter')
    df.to_excel(writer, sheet_name='sheetname', index=False)
    writer.save()

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用python boto3将文件上传到aws S3存储桶中的文件夹 - How to upload file to folder in aws S3 bucket using python boto3 我将如何使用 boto3 在 s3 存储桶上的 aws 文件上传成功响应? - How I will get response of success in aws file upload on s3 bucket using boto3? 如果存储桶上的现有标签包含“aws:”前缀,如何使用 Boto3 向 AWS S3 存储桶添加新标签? - How to add new tags to an AWS S3 Bucket using Boto3 if the existing tags on the bucket contains 'aws:' prefixes? 如何使用 boto3 将 Github 上的文件上传到 AWS S3 存储桶? - How can I upload the files on Github to AWS S3 bucket using boto3? 使用boto3从AppEngine将文件上传到AWS S3 - Upload File to AWS S3 from AppEngine using boto3 如何使用 Boto3 下载 S3 存储桶的最新文件? - How to download the latest file of an S3 bucket using Boto3? 使用 boto3 将 gzip 文件上传到 S3 存储桶的最佳方法 - Best way to upload gzip file to S3 bucket with boto3 将文件从 databricks dbfs / local 上传到 S3 存储桶。 如何使用 boto3 库或挂载 s3 将文件从 databricks 上传到 S3 存储桶? - Uploading a file from databricks dbfs / local to an S3 bucket. How do i upload a file from databricks to S3 bucket using boto3 library or mounting s3? 如何将文件上传到 S3 并使用 boto3 将其公开? - How to upload a file to S3 and make it public using boto3? 无凭据错误 - 使用 boto3 和 aws s3 存储桶 - No Credentials Error - Using boto3 and aws s3 bucket
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM