从 lambda 中的 s3 读取 doc、docx 文件

Question

TLDR; TLDR； reading with my AWS lambda doc , docx files that are stored on S3.使用我的 AWS lambda doc读取存储在 S3 上的docx文件。

On my local machine I just use textract.process(file_path) to read both doc and docx files.在我的本地机器上，我只使用textract.process(file_path)来读取 doc 和 docx 文件。

So the intuitive way to do the same on lambda is to download the file from s3 to the local storage ( tmp ) on the lambda and then process the tmp files like I do on my local machine.因此，在 lambda 上执行相同操作的直观方法是将文件从 s3 下载到 lambda 上的本地存储 ( tmp )，然后像我在本地计算机上一样处理tmp文件。

That's not cost-effective...这不划算...

Is there a way to make a pipeline from the S3 object straight into some parser like textract that'll just convert the doc / docx files into a readable object like string ?有没有办法从 S3 object 直接进入一些解析器（如textract ，将doc / docx文件转换为可读的 object （如string ）？

My code so far for reading files like txt.到目前为止，我的代码用于读取 txt 之类的文件。

import boto3

print('Loading function')


def lambda_handler(event, context):
    try:  # Read s3 file
        bucket_name = "appsresults"
        download_path = 'Folder1/file1.txt'
        filename = download_path
        s3 = boto3.resource('s3')
        content_object = s3.Object(bucket_name, filename)        

        file_content = content_object.get()['Body'].read().decode('utf-8')

        print(file_content)

    except Exception as e:
        print("Couldnt read the file from s3 because:\n {0}".format(e))

    return event  # return event

Answer 1

This answer solves half of the problem这个答案解决了一半的问题

textract.process currently doesn't support reading file-like objects . textract.process目前不支持读取类文件对象。 If it did, you could have directly loaded the file from S3 into memory and pass it to the process function.如果是这样，您可以直接将文件从 S3 加载到 memory 并将其传递给process function。

Older version of textract internally used python-docx package for reading .docx files. 旧版本的textract内部使用python-docx package 来读取.docx文件。 python-docx supports reading file-like objects. python-docx支持读取类文件对象。 You can use the below code to achieve your goal, at least for .docx files.您可以使用下面的代码来实现您的目标，至少对于.docx文件是这样。

import boto3
import io
from docx import Document

s3 = boto3.resource('s3')
bucket = s3.Bucket('my-bucket')
object = bucket.Object('/files/resume.docx')

file_stream = io.BytesIO()
object.download_fileobj(file_stream)

document = docx.Document(file_stream)

Answer 2

If you're reading the docx file from s3, Document() constructor expects path for the file.如果您正在从 s3 读取 docx 文件，则 Document() 构造函数需要文件的路径。 Instead, you can read the file in byte format and call the constructor like this.相反，您可以读取字节格式的文件并像这样调用构造函数。

from docx import Document

def parseDocx(data):
    data = io.BytesIO(data)
    document = Document(docx = data)
    content = ''
    for para in document.paragraphs:
        data = para.text
        content+= data
    return content

Key = "acb.docx"
Bucket = "xyz"
obj_ = s3_client.get_object(Bucket= Bucket, Key=Key)
if Key.endswith('.docx'):
    fs = obj_['Body'].read()
    sentence = str(parseDocx(fs))

从 lambda 中的 s3 读取 doc、docx 文件

问题描述

2 个解决方案

解决方案1
0 2020-05-04 16:12:24

解决方案2
0 2021-06-23 16:05:32

从 lambda 中的 s3 读取 doc、docx 文件

问题描述

2 个解决方案

解决方案1 0 2020-05-04 16:12:24

解决方案2 0 2021-06-23 16:05:32

解决方案1
0 2020-05-04 16:12:24

解决方案2
0 2021-06-23 16:05:32