[英]How to add boto3 output into a new file and upload it back to an AWS s3 bucket using one script?
The purpose of the code below is to read a pdf
file located in s3 bucket
and list the values of the pdf
in the terminal.下面代码的目的是读取位于
s3 bucket
中的pdf
文件并列出终端中pdf
的值。 My end goal is to load those values into a csv/xlsx
and upload it to the same s3 bucket
.我的最终目标是将这些值加载到
csv/xlsx
并将其上传到同一个s3 bucket
。 In other words, this is a file conversion from pdf
to xlsx
.换句话说,这是从
pdf
到xlsx
的文件转换。
Adding item to_excel at the end is not loading the data to excel, any suggestions?最后添加项目 to_excel 不会将数据加载到excel,有什么建议吗? the code below is only creating an empty
xlsx
file on the local directory but I need it to do the following:下面的代码只是在本地目录上创建一个空的
xlsx
文件,但我需要它来执行以下操作:
s3
s3
中的 pdf 将数据保存在终端中列出的 xlsx 中xlsx
file that has the terminal data and upload it back to s3
xlsx
文件并将其上传回s3
import boto3
import time
import pandas as pd
# Textract APIs used - "start_document_text_detection", "get_document_text_detection"
def InvokeTextDetectJob(s3BucketName, objectName):
response = None
client = boto3.client('textract')
response = client.start_document_text_detection(
DocumentLocation={
'S3Object': {
`enter code here` 'Bucket': s3BucketName,
'Name': objectName
}
})
return response["JobId"]
def CheckJobComplete(jobId):
time.sleep(5)
client = boto3.client('textract')
response = client.get_document_text_detection(JobId=jobId)
status = response["JobStatus"]
print("Job status: {}".format(status))
while(status == "IN_PROGRESS"):
time.sleep(5)
response = client.get_document_text_detection(JobId=jobId)
status = response["JobStatus"]
print("Job status: {}".format(status))
return status
def JobResults(jobId):
pages = []
client = boto3.client('textract')
response = client.get_document_text_detection(JobId=jobId)
pages.append(response)
print("Resultset page recieved: {}".format(len(pages)))
nextToken = None
if('NextToken' in response):
nextToken = response['NextToken']
while(nextToken):
response = client.get_document_text_detection(JobId=jobId, NextToken=nextToken)
pages.append(response)
print("Resultset page recieved: {}".format(len(pages)))
nextToken = None
if('NextToken' in response):
nextToken = response['NextToken']
return pages
# S3 Document Data
s3BucketName = "pdfbucket"
documentName = "pdf"
# Function invokes
jobId = InvokeTextDetectJob(s3BucketName, documentName)
print("Started job with id: {}".format(jobId))
if(CheckJobComplete(jobId)):
response = JobResults(jobId)
for resultPage in response:
for item in resultPage["Blocks"]:
if item["BlockType"] == "LINE":
print (item["Text"])
with pd.ExcelWriter('output_cp.xlsx') as writer:
item.to_excel(writer, sheetName='Sheet1')
So after some more research and checking with other people, below is the final product.因此,经过更多的研究和与其他人的检查,下面是最终产品。
The purpose of this code is the following:这段代码的目的如下:
pdf
file stored in an AWS S3 Bucket
AWS S3 Bucket
中的pdf
文件xlsx
filexlsx
文件中import boto3
import time
import pandas as pd
from xlsxwriter import Workbook
# Textract APIs used - "start_document_text_detection", "get_document_text_detection"
def InvokeTextDetectJob(s3BucketName, objectName):
response = None
client = boto3.client('textract')
response = client.start_document_text_detection(
DocumentLocation={
'S3Object': {
'Bucket': s3BucketName,
'Name': objectName
}
})
return response["JobId"]
def CheckJobComplete(jobId):
time.sleep(5)
client = boto3.client('textract')
response = client.get_document_text_detection(JobId=jobId)
status = response["JobStatus"]
print("Job status: {}".format(status))
while(status == "IN_PROGRESS"):
time.sleep(5)
response = client.get_document_text_detection(JobId=jobId)
status = response["JobStatus"]
print("Job status: {}".format(status))
return status
def JobResults(jobId):
pages = []
client = boto3.client('textract')
response = client.get_document_text_detection(JobId=jobId)
pages.append(response)
print("Resultset page recieved: {}".format(len(pages)))
nextToken = None
if('NextToken' in response):
nextToken = response['NextToken']
while(nextToken):
response = client.get_document_text_detection(
JobId=jobId, NextToken=nextToken)
pages.append(response)
print("Resultset page recieved: {}".format(len(pages)))
nextToken = None
if('NextToken' in response):
nextToken = response['NextToken']
return pages
# S3 Document Data
s3BucketName = "cpaypdf"
documentName = "cp.pdf"
# Function invokes
jobId = InvokeTextDetectJob(s3BucketName, documentName)
print("Started job with id: {}".format(jobId))
if(CheckJobComplete(jobId)):
response = JobResults(jobId)
df = pd.DataFrame(
columns=[
"Text"
]
)
for resultPage in response:
for item in resultPage["Blocks"]:
if item["BlockType"] == "LINE":
df = df.append({
"Text": item['Text']
},
ignore_index=True
)
writer = pd.ExcelWriter("result.xlsx", engine='xlsxwriter')
df.to_excel(writer, sheet_name='sheetname', index=False)
writer.save()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.