简体   繁体   English

从 pdf 中提取文本 从 S3 存储桶中提取文件 python

[英]extract text from pdf File from S3 bucket python

I have multiple format files in my AWS s3 bucket like pdf,doc,rtf,odt,png and I need to extract text from it.我的 AWS s3 存储桶中有多个格式文件,例如 pdf,doc,rtf,odt,png,我需要从中提取文本。 I have managed to get the list of contents with their path.now depending on the file type i will use different libraries to extract text from the file.我已经设法获得了内容列表及其路径。现在根据文件类型,我将使用不同的库从文件中提取文本。 since files can be in thousands i need to extract text directly from s3 instead of downloading.由于文件可能有数千个,我需要直接从 s3 中提取文本而不是下载。

filespath=['https://abc.s3.ap-south-1.amazonaws.com/DocumentOnPATest', 'https://abc.s3.ap-south-1.amazonaws.com/IndustryReport2019.pdf', 'https://abc.s3.ap-south-1.amazonaws.com/receipt.png', 'https://abc.s3.ap-south-1.amazonaws.com/sample.rtf', 'https://abc.s3.ap-south-1.amazonaws.com/sample1.odt']

bucketname =abc

I tried something but its giving me error我尝试了一些但它给了我错误

for path in filespath:
    ext=pathlib.Path(path).suffix
    if ext=='.pdf':
       pdf_file=PyPDF2.PdfFileReader(path)
       print(pdf_file.extractText())

but i am getting an error但我收到一个错误

  File "F:\Projects\FileExtractor\fileextracts3.py", line 28, in <module>
    pdf_file=PyPDF2.PdfFileReader(path)

  File "C:\ProgramData\Anaconda3\lib\site-packages\PyPDF2\pdf.py", line 1081, in __init__
    fileobj = open(stream, 'rb')

OSError: [Errno 22] Invalid argument: 'https://abc.s3.ap-south-1.amazonaws.com/IndustryReport2019.pdf

please help me with the lead.请帮我带头。 Thank you谢谢

PyPDF2 does not support reading from s3 directly. PyPDF2不支持直接从 s3 读取。 You'll need to download them first locally.您需要先在本地下载它们。

or you can try using [AWS Lambda functions][1] to process files directly from s3 buckets. 或者您可以尝试使用 [AWS Lambda 函数][1] 直接从 s3 存储桶处理文件。

You could try the boto3 solution here , provided by Justin Leto.您可以在此处尝试 boto3 解决方案,由 Justin Leto 提供。 You would still need a way of reading/converting the file stream for each file type but the PDF answer is there.对于每种文件类型,您仍然需要一种读取/转换文件 stream 的方法,但 PDF 答案就在那里。

import boto3
s3 = boto3.resource('s3')
obj = s3.Object(bucket_name, itemname)
fs = obj.get()['Body'].read()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM