简体   繁体   English

如何使用 boto 将文件从 Amazon S3 流式传输到 Rackspace Cloudfiles?

[英]How can I use boto to stream a file out of Amazon S3 to Rackspace Cloudfiles?

I'm copying a file from S3 to Cloudfiles, and I would like to avoid writing the file to disk.我正在将文件从 S3 复制到 Cloudfiles,我想避免将文件写入磁盘。 The Python-Cloudfiles library has an object.stream() call that looks to be what I need, but I can't find an equivalent call in boto. Python-Cloudfiles 库有一个 object.stream() 调用,看起来是我需要的,但我在 boto 中找不到等效的调用。 I'm hoping that I would be able to do something like:我希望我能够做这样的事情:

shutil.copyfileobj(s3Object.stream(),rsObject.stream())

Is this possible with boto (or I suppose any other s3 library)?这对 boto ( 或者我想任何其他 s3 库) 是否可行?

Other answers in this thread are related to boto, but S3.Object is not iterable anymore in boto3.此线程中的其他答案与 boto 相关,但 S3.Object 在 boto3 中不再可迭代。 So, the following DOES NOT WORK, it produces an TypeError: 's3.Object' object is not iterable error message:因此,以下TypeError: 's3.Object' object is not iterable ,它会产生TypeError: 's3.Object' object is not iterable错误消息:

s3 = boto3.session.Session(profile_name=my_profile).resource('s3')
s3_obj = s3.Object(bucket_name=my_bucket, key=my_key)

with io.FileIO('sample.txt', 'w') as file:
    for i in s3_obj:
        file.write(i)

In boto3, the contents of the object is available at S3.Object.get()['Body'] which is an iterable since version 1.9.68 but previously wasn't.在 boto3 中,对象的内容在S3.Object.get()['Body']可用,它自1.9.68版以来是可迭代的,但以前不是。 Thus the following will work for the latest versions of boto3 but not earlier ones:因此,以下内容适用于最新版本的 boto3,但不适用于早期版本:

body = s3_obj.get()['Body']
with io.FileIO('sample.txt', 'w') as file:
    for i in body:
        file.write(i)

So, an alternative for older boto3 versions is to use the read method, but this loads the WHOLE S3 object in memory which when dealing with large files is not always a possibility:因此,旧 boto3 版本的替代方法是使用 read 方法,但这会在内存中加载整个 S3 对象,这在处理大文件时并不总是可行的:

body = s3_obj.get()['Body']
with io.FileIO('sample.txt', 'w') as file:
    for i in body.read():
        file.write(i)

But the read method allows to pass in the amt parameter specifying the number of bytes we want to read from the underlying stream.但是read方法允许传入amt参数,指定我们要从底层流读取的字节数。 This method can be repeatedly called until the whole stream has been read:可以重复调用此方法,直到读取了整个流:

body = s3_obj.get()['Body']
with io.FileIO('sample.txt', 'w') as file:
    while file.write(body.read(amt=512)):
        pass

Digging into botocore.response.StreamingBody code one realizes that the underlying stream is also available, so we could iterate as follows:深入研究botocore.response.StreamingBody代码一发现底层流也是可用的,所以我们可以迭代如下:

body = s3_obj.get()['Body']
with io.FileIO('sample.txt', 'w') as file:
    for b in body._raw_stream:
        file.write(b)

While googling I've also seen some links that could be use, but I haven't tried:在谷歌搜索时,我也看到了一些可以使用的链接,但我没有尝试过:

I figure at least some of the people seeing this question will be like me, and will want a way to stream a file from boto line by line (or comma by comma, or any other delimiter).我认为至少有一些看到这个问题的人会像我一样,想要一种从 boto 逐行(或逗号一个逗号,或任何其他分隔符)流式传输文件的方法。 Here's a simple way to do that:这是一个简单的方法来做到这一点:

def getS3ResultsAsIterator(self, aws_access_info, key, prefix):        
    s3_conn = S3Connection(**aws_access)
    bucket_obj = s3_conn.get_bucket(key)
    # go through the list of files in the key
    for f in bucket_obj.list(prefix=prefix):
        unfinished_line = ''
        for byte in f:
            byte = unfinished_line + byte
            #split on whatever, or use a regex with re.split()
            lines = byte.split('\n')
            unfinished_line = lines.pop()
            for line in lines:
                yield line

@garnaat's answer above is still great and 100% true. @garnaat 上面的回答仍然很棒,而且 100% 正确。 Hopefully mine still helps someone out.希望我的仍然可以帮助某人。

The Key object in boto, which represents on object in S3, can be used like an iterator so you should be able to do something like this: boto 中的 Key 对象代表 S3 中的对象,可以像迭代器一样使用,因此您应该能够执行以下操作:

>>> import boto
>>> c = boto.connect_s3()
>>> bucket = c.lookup('garnaat_pub')
>>> key = bucket.lookup('Scan1.jpg')
>>> for bytes in key:
...   write bytes to output stream

Or, as in the case of your example, you could do:或者,就像你的例子一样,你可以这样做:

>>> shutil.copyfileobj(key, rsObject.stream())

Botocore's StreamingBody has an iter_lines() method: Botocore 的StreamingBody有一个iter_lines()方法:

https://botocore.amazonaws.com/v1/documentation/api/latest/reference/response.html#botocore.response.StreamingBody.iter_lines https://botocore.amazonaws.com/v1/documentation/api/latest/reference/response.html#botocore.response.StreamingBody.iter_lines

So:所以:

import boto3
s3r = boto3.resource('s3')
iterator = s3r.Object(bucket, key).get()['Body'].iter_lines()

for line in iterator:
    print(line)

This is my solution of wrapping streaming body:这是我包装流体的解决方案:

import io
class S3ObjectInterator(io.RawIOBase):
    def __init__(self, bucket, key):
        """Initialize with S3 bucket and key names"""
        self.s3c = boto3.client('s3')
        self.obj_stream = self.s3c.get_object(Bucket=bucket, Key=key)['Body']

    def read(self, n=-1):
        """Read from the stream"""
        return self.obj_stream.read() if n == -1 else self.obj_stream.read(n)

Example usage:用法示例:

obj_stream = S3ObjectInterator(bucket, key)
for line in obj_stream:
    print line

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM