[英]Reading doc, docx files from s3 within lambda
TLDR; TLDR; reading with my AWS lambda
doc
, docx
files that are stored on S3.使用我的 AWS lambda
doc
读取存储在 S3 上的docx
文件。
On my local machine I just use textract.process(file_path)
to read both doc and docx files.在我的本地机器上,我只使用
textract.process(file_path)
来读取 doc 和 docx 文件。
So the intuitive way to do the same on lambda is to download the file from s3 to the local storage ( tmp
) on the lambda and then process the tmp
files like I do on my local machine.因此,在 lambda 上执行相同操作的直观方法是将文件从 s3 下载到 lambda 上的本地存储 (
tmp
),然后像我在本地计算机上一样处理tmp
文件。
That's not cost-effective...这不划算...
Is there a way to make a pipeline from the S3 object straight into some parser like textract
that'll just convert the doc
/ docx
files into a readable object like string
?有没有办法从 S3 object 直接进入一些解析器(如
textract
,将doc
/ docx
文件转换为可读的 object (如string
)?
My code so far for reading files like txt.到目前为止,我的代码用于读取 txt 之类的文件。
import boto3
print('Loading function')
def lambda_handler(event, context):
try: # Read s3 file
bucket_name = "appsresults"
download_path = 'Folder1/file1.txt'
filename = download_path
s3 = boto3.resource('s3')
content_object = s3.Object(bucket_name, filename)
file_content = content_object.get()['Body'].read().decode('utf-8')
print(file_content)
except Exception as e:
print("Couldnt read the file from s3 because:\n {0}".format(e))
return event # return event
This answer solves half of the problem这个答案解决了一半的问题
textract.process
currently doesn't support reading file-like objects . textract.process
目前不支持读取类文件对象。 If it did, you could have directly loaded the file from S3 into memory and pass it to the process
function.如果是这样,您可以直接将文件从 S3 加载到 memory 并将其传递给
process
function。
Older version of textract
internally used python-docx
package for reading .docx
files. 旧版本的
textract
内部使用python-docx
package 来读取.docx
文件。 python-docx
supports reading file-like objects. python-docx
支持读取类文件对象。 You can use the below code to achieve your goal, at least for .docx
files.您可以使用下面的代码来实现您的目标,至少对于
.docx
文件是这样。
import boto3
import io
from docx import Document
s3 = boto3.resource('s3')
bucket = s3.Bucket('my-bucket')
object = bucket.Object('/files/resume.docx')
file_stream = io.BytesIO()
object.download_fileobj(file_stream)
document = docx.Document(file_stream)
If you're reading the docx file from s3, Document() constructor expects path for the file.如果您正在从 s3 读取 docx 文件,则 Document() 构造函数需要文件的路径。 Instead, you can read the file in byte format and call the constructor like this.
相反,您可以读取字节格式的文件并像这样调用构造函数。
from docx import Document
def parseDocx(data):
data = io.BytesIO(data)
document = Document(docx = data)
content = ''
for para in document.paragraphs:
data = para.text
content+= data
return content
Key = "acb.docx"
Bucket = "xyz"
obj_ = s3_client.get_object(Bucket= Bucket, Key=Key)
if Key.endswith('.docx'):
fs = obj_['Body'].read()
sentence = str(parseDocx(fs))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.