简体   繁体   中英

Selecting multiple PDFs based on keywords and uploading them in a S3 Bucket using Python boto3

Problem description: I have PDF files in a S3 Bucket called "cases". I need to loop through all these PDFs, read each, and select PDFs based on keywords. The PDFs that contain the specified keywords need to be store in the "confirmed-covid19" bucket. Those PDFs without the specified keywords will be stored in the "no-covid" bucket.

Error: "ValueError: Filename must be a string."

Narrative: I ran the code in chunks to identify errors. The code shown below works above Line 37. The error is related to the code written below Line 37. My understanding is that the function 'upload_file' only take strings for the Filename and Key parameters. How can fix this issue, and put the selected PDFs containing keywords in the "confirmed-covid19" bucket? and the rest in the "no-covid" bucket? I still want to keep the original name of each PDF file. What is the most efficient way to achieve this task? Also, all suggestions to improve the code are welcome.

import PyPDF2
import re
import os
import textract
import boto3
import glob
from PyPDF2 import PdfFileReader
from io import BytesIO

# Call boto3 to access AWS S3:
s3 = boto3.resource(
     service_name='s3',
     region_name='us-east-1',
     aws_access_key_id='MY_ACCESS_KEY_ID',
     aws_secret_access_key='MY_SECRET_ACCESS_KEY'
)

# Define S3 Bucket name:
bucket_name = s3.Bucket("cases")

# define keywords
search_words = ['Covid-19','Corona','virus'] # Look for these words in PDFs.

# Clients provide a low-level interface to AWS
s3_client = boto3.client('s3')

for filename in bucket_name.objects.all(): # Object summary iterator.
    body = filename.get()['Body'].read()
    f = PdfFileReader(BytesIO(body))       # Read the content of each file
    
    # Search for keywords
    for i in range(f.numPages):
        page = f.getPage(i)          # get pages from pdf files
        text = page.extractText()    # extract the text from each page
        search_text = text.lower().split()
        
# ------------------------------ Line 37 -------------------------------- #   

        for word in search_words:         # look at each keyword 
            if word in search_text:       # find the keyword(s) in the text
                s3_client.upload_file(filename, 'confirmed-covid19', filename)
            else:
                s3_client.upload_file(filename, 'no-covid', filename)

You should try the pdfminer module. It extracts the text from the PDF and writes a txt file.

@Michael

Read the boto docs:

https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-uploading-files.html

The upload file, uploads a local file to s3. You dont have a local file. You can create it locally and upload it or copy from the file from one bucket to another.

This is solved here:

how to copy s3 object from one bucket to another using python boto3

Best Regards.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM