[英]How to use pdfminer to extract text from PDF files stored in S3 bucket without downloading it locally?
I have a PDF stored in S3 bucket.我有一个 PDF 存储在 S3 存储桶中。 I want to extract texts using pdfminer from that PDF file.
我想使用 pdfminer 从 PDF 文件中提取文本。
When the file is stored locally, I am able to extract using the below code:当文件存储在本地时,我可以使用以下代码进行提取:
from pdfminer3.layout import LAParams, LTTextBox
from pdfminer3.pdfpage import PDFPage
from pdfminer3.pdfinterp import PDFResourceManager
from pdfminer3.pdfinterp import PDFPageInterpreter
from pdfminer3.converter import PDFPageAggregator
from pdfminer3.converter import TextConverter
from pdfminer.high_level import extract_pages
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
import io
from urllib.parse import urlparse
resource_manager = PDFResourceManager()
file_handle = io.StringIO()
converter = TextConverter(resource_manager, file_handle, laparams=LAParams())
page_interpreter = PDFPageInterpreter(resource_manager, converter)
pdf_file = 'file.pdf'
with open(pdf_file, 'rb') as fh:
for page in PDFPage.get_pages(fh,
caching=True,
check_extractable=True):
page_interpreter.process_page(page)
text = file_handle.getvalue()
# close open handles
converter.close()
file_handle.close()
total_no_pages = len(list(extract_pages(pdf_file)))
print(total_no_pages)
print(text)
I can extract the texts in a clean fashion.我可以以干净的方式提取文本。
However, I want to do the same for PDFs stored in S3.但是,我想对存储在 S3 中的 PDF 执行相同的操作。
I have made a connection to the S3 bucket and fetched the data like this:我已经连接到 S3 存储桶并获取如下数据:
import boto3, os
s3 = boto3.resource(
service_name='s3',
region_name=<region-name>,
aws_access_key_id=<access-key>,
aws_secret_access_key=<secret-key>
)
bucket_name = <bucket_name>
item_name = <folederName/file.pdf>
obj = s3.Object(bucket_name, item_name)
fs = obj.get()['Body'].read()
When I print fs
, I see that it returns data in bytes.当我打印
fs
时,我看到它以字节为单位返回数据。
Kindly suggest a way to use pdfminer for texts stored in S3.请建议一种将 pdfminer 用于存储在 S3 中的文本的方法。
Instead of代替
get_pages(fh,caching=True, check_extractable=True):
you could have:你可以有:
get_pages(io.BytesIO(fs), caching=True, check_extractable=True):
By the way, you are still downloading the objects from S3, but not physically saving them on your local hard drive.顺便说一句,您仍在从 S3 下载对象,但并未将它们物理保存在本地硬盘上。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.