从 pdf 中提取文本从 S3 存储桶中提取文件 python

Question

I have multiple format files in my AWS s3 bucket like pdf,doc,rtf,odt,png and I need to extract text from it.我的 AWS s3 存储桶中有多个格式文件，例如 pdf,doc,rtf,odt,png，我需要从中提取文本。 I have managed to get the list of contents with their path.now depending on the file type i will use different libraries to extract text from the file.我已经设法获得了内容列表及其路径。现在根据文件类型，我将使用不同的库从文件中提取文本。 since files can be in thousands i need to extract text directly from s3 instead of downloading.由于文件可能有数千个，我需要直接从 s3 中提取文本而不是下载。

filespath=['https://abc.s3.ap-south-1.amazonaws.com/DocumentOnPATest', 'https://abc.s3.ap-south-1.amazonaws.com/IndustryReport2019.pdf', 'https://abc.s3.ap-south-1.amazonaws.com/receipt.png', 'https://abc.s3.ap-south-1.amazonaws.com/sample.rtf', 'https://abc.s3.ap-south-1.amazonaws.com/sample1.odt']

bucketname =abc

I tried something but its giving me error我尝试了一些但它给了我错误

for path in filespath:
    ext=pathlib.Path(path).suffix
    if ext=='.pdf':
       pdf_file=PyPDF2.PdfFileReader(path)
       print(pdf_file.extractText())

but i am getting an error但我收到一个错误

  File "F:\Projects\FileExtractor\fileextracts3.py", line 28, in <module>
    pdf_file=PyPDF2.PdfFileReader(path)

  File "C:\ProgramData\Anaconda3\lib\site-packages\PyPDF2\pdf.py", line 1081, in __init__
    fileobj = open(stream, 'rb')

OSError: [Errno 22] Invalid argument: 'https://abc.s3.ap-south-1.amazonaws.com/IndustryReport2019.pdf

please help me with the lead.请帮我带头。 Thank you谢谢

Answer 1

PyPDF2 does not support reading from s3 directly. PyPDF2不支持直接从 s3 读取。 You'll need to download them first locally.您需要先在本地下载它们。

~~or you can try using [AWS Lambda functions][1] to process files directly from s3 buckets.~~ ~~或者您可以尝试使用 [AWS Lambda 函数][1] 直接从 s3 存储桶处理文件。~~

Answer 2

You could try the boto3 solution here , provided by Justin Leto.您可以在此处尝试 boto3 解决方案，由 Justin Leto 提供。 You would still need a way of reading/converting the file stream for each file type but the PDF answer is there.对于每种文件类型，您仍然需要一种读取/转换文件 stream 的方法，但 PDF 答案就在那里。

import boto3
s3 = boto3.resource('s3')
obj = s3.Object(bucket_name, itemname)
fs = obj.get()['Body'].read()

从 pdf 中提取文本从 S3 存储桶中提取文件 python

问题描述

2 个解决方案

解决方案1
0 2021-01-19 08:11:04

解决方案2
0 2022-02-12 21:56:50

从 pdf 中提取文本 从 S3 存储桶中提取文件 python

问题描述

2 个解决方案

解决方案1 0 2021-01-19 08:11:04

解决方案2 0 2022-02-12 21:56:50

从 pdf 中提取文本从 S3 存储桶中提取文件 python

解决方案1
0 2021-01-19 08:11:04

解决方案2
0 2022-02-12 21:56:50