AWS Lambda 函数可以直接处理 s3 上的文件还是需要移动到 /tmp/？

Question

I am trying to write an AWS lambda function in Python to collect a bunch of csv files from an s3 bucket, concatenate them, drop the duplicates and write the result back to s3.我正在尝试用 Python 编写一个 AWS lambda 函数来从 s3 存储桶中收集一堆 csv 文件，将它们连接起来，删除重复项并将结果写回 s3。 The files I want to read are stored with a prefix/in a folder on s3.我要读取的文件以前缀/在 s3 上的文件夹中存储。 Currently I am trying to read the files one by one using the following apporach:目前我正在尝试使用以下方法一一读取文件：

resp = s3.list_objects_v2(Bucket='mybucket')
#getting all objects in the bucket in a list
for obj in resp['Contents']:
    keys.append(obj['Key'])
#filtering those that are parsed entries
files = [k[6:] for k in keys if 'links/links' in k]
#reading into combined list
for file in files:
    with open(file, 'r') as csvfile:
        reader = csv.reader(csvfile)
        links = links + list(reader)

Currently I am getting the following error:目前我收到以下错误：

{
  "errorMessage": "[Errno 2] No such file or directory: 'links2020-02-27 14:59:49.933074.csv'",
  "errorType": "FileNotFoundError",
  "stackTrace": [
    "  File \"/var/task/handler.py\", line 21, in concatenatelinks\n    with open(file, 'r') as csvfile:\n"
  ]
}

In an earlier version, I didn't slice the filenames, which caused the same error.在早期版本中，我没有对文件名进行切片，这导致了同样的错误。 So do I need to load all files into /tmp/ with something like s3.meta.client.upload_file('/tmp/' + str(filename), bucket, 'fusedlinks/' + str(filename)) to make them accessible to the lamda function or is there a more elegant solution to this?那么我是否需要使用s3.meta.client.upload_file('/tmp/' + str(filename), bucket, 'fusedlinks/' + str(filename))类的东西将所有文件加载到 /tmp/ 中以使其可访问到 lamda 函数还是有更优雅的解决方案？

Answer 1

From the error it seems that filename convention is incorrect: links2020-02-27 14:59:49.933074.csv .从错误来看，文件名约定似乎不正确： links2020-02-27 14:59:49.933074.csv 。 You probably need to escape the " whitespace " while reading the file through boto3 client.通过 boto3 客户端读取文件时，您可能需要转义“空白”。

But to read the file there are two options, I personally prefer Option 2 (but depends on memory usage):但是要读取文件有两个选项，我个人更喜欢选项 2（但取决于内存使用情况）：

One to use filesystem as /tmp一种使用文件系统作为 /tmp

You can refer the sample example as mentioned on AWS Documentation您可以参考AWS 文档中提到的示例示例

Also AWS Lambda provides /tmp size of 512 MB at the moment, you will need to find a different solution if total size of all files is more than 512 MB.此外，AWS Lambda 目前提供 512 MB 的 /tmp 大小，如果所有文件的总大小超过 512 MB，您将需要找到不同的解决方案。 Refer AWS Lambda Limits请参阅AWS Lambda 限制

Second option is to use In-memory buffer .第二种选择是使用 In-memory buffer 。 You can use: Python's BytesIo .您可以使用：Python 的BytesIo 。 Example below:下面的例子：

    def load_from_s3(bucket, path):
        s3_resource = boto3.resource('s3')
        with BytesIO() as data:
            s3_resource.Bucket(bucket).download_fileobj(path, data)
            data.seek(0)
            # Do something with your data in file

AWS Lambda 函数可以直接处理 s3 上的文件还是需要移动到 /tmp/？

问题描述

1 个解决方案

解决方案1
3 2020-03-01 11:42:52

AWS Lambda 函数可以直接处理 s3 上的文件还是需要移动到 /tmp/？

问题描述

1 个解决方案

解决方案1 3 2020-03-01 11:42:52

解决方案1
3 2020-03-01 11:42:52