简体   繁体   English

根据关键字选择多个 PDF 并使用 Python boto3 将它们上传到 S3 Bucket 中

[英]Selecting multiple PDFs based on keywords and uploading them in a S3 Bucket using Python boto3

Problem description: I have PDF files in a S3 Bucket called "cases".问题描述:我在名为“cases”的 S3 Bucket 中有 PDF 文件。 I need to loop through all these PDFs, read each, and select PDFs based on keywords.我需要遍历所有这些 PDF,阅读每一个,然后根据关键字选择 PDF。 The PDFs that contain the specified keywords need to be store in the "confirmed-covid19" bucket.包含指定关键字的 PDF 需要存储在“confirmed-covid19”存储桶中。 Those PDFs without the specified keywords will be stored in the "no-covid" bucket.那些没有指定关键字的 PDF 将存储在“no-covid”存储桶中。

Error: "ValueError: Filename must be a string."错误:“ValueError:文件名必须是字符串。”

Narrative: I ran the code in chunks to identify errors.叙述:我分块运行代码以识别错误。 The code shown below works above Line 37. The error is related to the code written below Line 37. My understanding is that the function 'upload_file' only take strings for the Filename and Key parameters.下面显示的代码在第 37 行上方工作。错误与在第 37 行下方编写的代码有关。我的理解是函数 'upload_file' 只接受文件名和密钥参数的字符串。 How can fix this issue, and put the selected PDFs containing keywords in the "confirmed-covid19" bucket?如何解决此问题,并将包含关键字的选定 PDF 放入“confirmed-covid19”存储桶中? and the rest in the "no-covid" bucket?其余的在“无covid”桶中? I still want to keep the original name of each PDF file.我还是想保留每个PDF文件的原始名称。 What is the most efficient way to achieve this task?完成这项任务的最有效方法是什么? Also, all suggestions to improve the code are welcome.此外,欢迎所有改进代码的建议。

import PyPDF2
import re
import os
import textract
import boto3
import glob
from PyPDF2 import PdfFileReader
from io import BytesIO

# Call boto3 to access AWS S3:
s3 = boto3.resource(
     service_name='s3',
     region_name='us-east-1',
     aws_access_key_id='MY_ACCESS_KEY_ID',
     aws_secret_access_key='MY_SECRET_ACCESS_KEY'
)

# Define S3 Bucket name:
bucket_name = s3.Bucket("cases")

# define keywords
search_words = ['Covid-19','Corona','virus'] # Look for these words in PDFs.

# Clients provide a low-level interface to AWS
s3_client = boto3.client('s3')

for filename in bucket_name.objects.all(): # Object summary iterator.
    body = filename.get()['Body'].read()
    f = PdfFileReader(BytesIO(body))       # Read the content of each file
    
    # Search for keywords
    for i in range(f.numPages):
        page = f.getPage(i)          # get pages from pdf files
        text = page.extractText()    # extract the text from each page
        search_text = text.lower().split()
        
# ------------------------------ Line 37 -------------------------------- #   

        for word in search_words:         # look at each keyword 
            if word in search_text:       # find the keyword(s) in the text
                s3_client.upload_file(filename, 'confirmed-covid19', filename)
            else:
                s3_client.upload_file(filename, 'no-covid', filename)

You should try the pdfminer module.您应该尝试使用 pdfminer 模块。 It extracts the text from the PDF and writes a txt file.它从 PDF 中提取文本并写入一个 txt 文件。

@Michael @迈克尔

Read the boto docs:阅读 boto 文档:

https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-uploading-files.html https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-uploading-files.html

The upload file, uploads a local file to s3.上传文件,将本地文件上传到s3。 You dont have a local file.您没有本地文件。 You can create it locally and upload it or copy from the file from one bucket to another.您可以在本地创建它并将其上传或从文件从一个存储桶复制到另一个存储桶。

This is solved here:这在这里解决:

how to copy s3 object from one bucket to another using python boto3 如何使用python boto3将s3对象从一个存储桶复制到另一个存储桶

Best Regards.此致。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM