简体   繁体   English

什么是 Python 的 `open()` function 的替代品,用于在 S3 上读/写文件?

[英]What is a drop-in replacement for Python's `open()` function to read/write a file on S3?

What is a good way to replace Python's built-in open() function when working with Amazon S3 buckets in an AWS Lambda function?在 AWS Lambda function 中使用 Amazon S3 存储桶时,有什么好方法可以替代 Python 的内置open() function?

Summary概括

  • I am looking for a method to download a file from or upload a file to Amazon S3 in an AWS Lambda function.我正在寻找一种在 AWS Lambda function 中从 Amazon S3 下载文件或将文件上传到 Amazon S3 的方法。
  • The syntax/API should similar to Python's built-in open() , specifically returning a file-like object that could be passed to other functions like pandas.read_csv() .语法/API 应该类似于 Python 的内置open() ,特别是返回一个类似于 object 的文件,它可以传递给其他函数,如pandas.read_csv()
    • I am mostly interested in read() and write() and not so much seek() or tell() , which would be be required for PIL.Image.open() for example.我最感兴趣的是read()write()而不是seek()tell() ,例如PIL.Image.open()所需要的。
  • The method should use libraries already available in AWS Lambda, eg boto3.该方法应使用 AWS Lambda 中已有的库,例如 boto3。
  • It should keep the Lambda deployment size small.它应该使 Lambda 的部署规模保持较小。 Thus not a large dependency like s3fs , which is usually overkill for an AWS Lambda.因此不像s3fs这样的大依赖项,这对于 AWS Lambda 通常是过大的。

Here is an example of what I am thinking of.这是我在想的一个例子。

filename = "s3://mybucket/path/to/file.txt"
outpath = "s3://mybucket/path/to/lowercase.txt"

with s3_open(filename) as fd, s3_open(outpath, "wt") as fout:
    for line in fd:
        fout.write(line.strip().lower())

Motivation动机

Most people using Python are familiar with大多数使用 Python 的人都熟悉

filename = "/path/to/file.txt"
with open(filename) as fd:
    lines = fd.readlines()

Those using Amazon S3 are also probably familiar with S3 URIs , but S3 URIs are not convenient for working with boto3 , the Amazon S3 Python SDK:使用 Amazon S3 的人可能也熟悉S3 URIs ,但是 S3 URIs 不方便与boto3一起使用,Amazon S3 Python SDK:

  • boto3 uses parameters like s3.get_object(Bucket=bucket, Key=key) , whereas I usually have the S3 URI boto3 使用s3.get_object(Bucket=bucket, Key=key)之类的参数,而我通常使用 S3 URI
  • boto3 returns a json response, which contains a StreamingBody and all I want is the StreamingBody boto3 返回一个 json 响应,其中包含一个 StreamingBody 而我想要的只是 StreamingBody
  • The StreamingBody returns bytes but text is usually more convenient StreamingBody 返回字节,但文本通常更方便

Many Python libraries accept file-like objects , eg json, pandas, zipfile.许多 Python 库接受 类文件对象,例如 json、pandas、zipfile。

I often just need to download/upload a single file to S3 so there's no need to manage a whole file system.我通常只需要将单个文件下载/上传到 S3,因此无需管理整个文件系统。 Nor do I need or want to save the file to disk only to read it back into memory.我也不需要或不想将文件保存到磁盘只是为了将其读回 memory。

A start一个开始

import io
import boto3

session = boto3.Session()
s3_client = boto3.client("s3")

def s3uriparse(s3_uri):
    raise NotImplmementedError


def s3_open(s3_uri, mode="rt"):
    bucket, key = s3uriparse(s3_uri)
    
    if mode.startswith("r"):
        r = s3_client.get_object(Bucket=bucket, Key=key)
        fileobj = r["StreamingBody"]
        if mode.endswith("t"):
            fileobj = io.TextIOWrapper(fileobj._raw_stream)
        return fileobj
    elif mode.startswith("w"):
        # Write mode
        raise NotImplementedError
    else:
        raise ValueError("Invalid mode")

There is a Python library called smart-open · PyPI .有一个 Python 库,叫做smart-open·PyPI

It's really good, because you can use all the file-handling commands you're familiar with, and it works with S3 objects.它非常好,因为您可以使用您熟悉的所有文件处理命令,并且它适用于 S3 对象。 It can also read from compressed files.它还可以从压缩文件中读取。

>>> from smart_open import open
>>>
>>> # stream lines from an S3 object
>>> for line in open('s3://commoncrawl/robots.txt'):
...    print(repr(line))
...    break
'User-Agent: *\n'

>>> # stream from/to compressed files, with transparent (de)compression:
>>> for line in open('smart_open/tests/test_data/1984.txt.gz', encoding='utf-8'):
...    print(repr(line))
'It was a bright cold day in April, and the clocks were striking thirteen.\n'
'Winston Smith, his chin nuzzled into his breast in an effort to escape the vile\n'
'wind, slipped quickly through the glass doors of Victory Mansions, though not\n'
'quickly enough to prevent a swirl of gritty dust from entering along with him.\n'

>>> # can use context managers too:
>>> with open('smart_open/tests/test_data/1984.txt.gz') as fin:
...    with open('smart_open/tests/test_data/1984.txt.bz2', 'w') as fout:
...        for line in fin:
...           fout.write(line)
74
80
78
79

>>> # can use any IOBase operations, like seek
>>> with open('s3://commoncrawl/robots.txt', 'rb') as fin:
...     for line in fin:
...         print(repr(line.decode('utf-8')))
...         break
...     offset = fin.seek(0)  # seek to the beginning
...     print(fin.read(4))
'User-Agent: *\n'
b'User'

>>> # stream from HTTP
>>> for line in open('http://example.com/index.html'):
...     print(repr(line))
...     break
'<!doctype html>\n'

I am confused about your motivation: what is wrong with我对你的动机感到困惑:有什么问题

s3 = boto3.resource('s3')
s3_object = s3.Object(bucket_name, full_key)
s3_object.put(Body=byte_stream_of_some_kind)

for write and写和

s3 = boto3.client('s3')
s3_object_byte_stream = s3.get_object(Bucket=bucket_name, Key=object_key)['Body'].read()

for stream in the object to your lambda to update?对于stream中的object到你的lambda要更新吗?

Both of these functionalities stream straight into or out of s3 - you don't have to download the object then open it in a stream after the fact - if you want to you can still use a with statement with either to close them up automatically (use it on the resource)这两个功能 stream 直接进入或离开 s3 - 你不必下载 object 然后在 stream 中打开它 - 如果你愿意,你仍然可以使用 with 语句来自动关闭它们(在资源上使用它)

There is also no file system in s3 - though we use object keys with a file system like nomenclature (the '/' and that is displayed in the console as if it were directories, the actual layout of the an s3 internally is flat with the names having some parsing ability. So if you know the full key of an object you can stream it into and out of your lambda without any downloading at all s3 中也没有文件系统——虽然我们使用 object 键和类似命名法的文件系统('/' 并且在控制台中显示为目录,但 s3 内部的实际布局与名称具有一定的解析能力。因此,如果您知道 object 的完整密钥,您可以将其 stream 进出您的 lambda,而无需任何下载

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM