如何使用 boto 將文件從 Amazon S3 流式傳輸到 Rackspace Cloudfiles？

Question

我正在將文件從 S3 復制到 Cloudfiles，我想避免將文件寫入磁盤。 Python-Cloudfiles 庫有一個 object.stream() 調用，看起來是我需要的，但我在 boto 中找不到等效的調用。 我希望我能夠做這樣的事情：

shutil.copyfileobj(s3Object.stream(),rsObject.stream())

這對 boto ( 或者我想任何其他 s3 庫) 是否可行？

Answer 1

此線程中的其他答案與 boto 相關，但 S3.Object 在 boto3 中不再可迭代。 因此，以下TypeError: 's3.Object' object is not iterable ，它會產生TypeError: 's3.Object' object is not iterable錯誤消息：

s3 = boto3.session.Session(profile_name=my_profile).resource('s3')
s3_obj = s3.Object(bucket_name=my_bucket, key=my_key)

with io.FileIO('sample.txt', 'w') as file:
    for i in s3_obj:
        file.write(i)

在 boto3 中，對象的內容在S3.Object.get()['Body']可用，它自1.9.68版以來是可迭代的，但以前不是。 因此，以下內容適用於最新版本的 boto3，但不適用於早期版本：

body = s3_obj.get()['Body']
with io.FileIO('sample.txt', 'w') as file:
    for i in body:
        file.write(i)

因此，舊 boto3 版本的替代方法是使用 read 方法，但這會在內存中加載整個 S3 對象，這在處理大文件時並不總是可行的：

body = s3_obj.get()['Body']
with io.FileIO('sample.txt', 'w') as file:
    for i in body.read():
        file.write(i)

但是read方法允許傳入amt參數，指定我們要從底層流讀取的字節數。 可以重復調用此方法，直到讀取了整個流：

body = s3_obj.get()['Body']
with io.FileIO('sample.txt', 'w') as file:
    while file.write(body.read(amt=512)):
        pass

深入研究botocore.response.StreamingBody代碼一發現底層流也是可用的，所以我們可以迭代如下：

body = s3_obj.get()['Body']
with io.FileIO('sample.txt', 'w') as file:
    for b in body._raw_stream:
        file.write(b)

在谷歌搜索時，我也看到了一些可以使用的鏈接，但我沒有嘗試過：

包裹流體
另一個相關線程
boto3 github 中請求 StreamingBody 的一個問題是一個正確的流- 已關閉！！！

Answer 2

我認為至少有一些看到這個問題的人會像我一樣，想要一種從 boto 逐行（或逗號一個逗號，或任何其他分隔符）流式傳輸文件的方法。 這是一個簡單的方法來做到這一點：

def getS3ResultsAsIterator(self, aws_access_info, key, prefix):        
    s3_conn = S3Connection(**aws_access)
    bucket_obj = s3_conn.get_bucket(key)
    # go through the list of files in the key
    for f in bucket_obj.list(prefix=prefix):
        unfinished_line = ''
        for byte in f:
            byte = unfinished_line + byte
            #split on whatever, or use a regex with re.split()
            lines = byte.split('\n')
            unfinished_line = lines.pop()
            for line in lines:
                yield line

@garnaat 上面的回答仍然很棒，而且 100% 正確。 希望我的仍然可以幫助某人。

Answer 3

boto 中的 Key 對象代表 S3 中的對象，可以像迭代器一樣使用，因此您應該能夠執行以下操作：

>>> import boto
>>> c = boto.connect_s3()
>>> bucket = c.lookup('garnaat_pub')
>>> key = bucket.lookup('Scan1.jpg')
>>> for bytes in key:
...   write bytes to output stream

或者，就像你的例子一樣，你可以這樣做：

>>> shutil.copyfileobj(key, rsObject.stream())

Answer 4

Botocore 的StreamingBody有一個iter_lines()方法：

https://botocore.amazonaws.com/v1/documentation/api/latest/reference/response.html#botocore.response.StreamingBody.iter_lines

所以：

import boto3
s3r = boto3.resource('s3')
iterator = s3r.Object(bucket, key).get()['Body'].iter_lines()

for line in iterator:
    print(line)

Answer 5

這是我包裝流體的解決方案：

import io
class S3ObjectInterator(io.RawIOBase):
    def __init__(self, bucket, key):
        """Initialize with S3 bucket and key names"""
        self.s3c = boto3.client('s3')
        self.obj_stream = self.s3c.get_object(Bucket=bucket, Key=key)['Body']

    def read(self, n=-1):
        """Read from the stream"""
        return self.obj_stream.read() if n == -1 else self.obj_stream.read(n)

用法示例：

obj_stream = S3ObjectInterator(bucket, key)
for line in obj_stream:
    print line

如何使用 boto 將文件從 Amazon S3 流式傳輸到 Rackspace Cloudfiles？

問題描述

5 個解決方案

解決方案1
72 2016-11-17 17:32:35

解決方案2
21 2013-06-03 04:29:35

解決方案3
21 已采納 2011-10-02 07:54:34

解決方案4
10 2018-08-31 19:28:23

解決方案5
7 2016-11-28 22:26:10

如何使用 boto 將文件從 Amazon S3 流式傳輸到 Rackspace Cloudfiles？

問題描述

5 個解決方案

解決方案1 72 2016-11-17 17:32:35

解決方案2 21 2013-06-03 04:29:35

解決方案3 21 已采納 2011-10-02 07:54:34

解決方案4 10 2018-08-31 19:28:23

解決方案5 7 2016-11-28 22:26:10

解決方案1
72 2016-11-17 17:32:35

解決方案2
21 2013-06-03 04:29:35

解決方案3
21 已采納 2011-10-02 07:54:34

解決方案4
10 2018-08-31 19:28:23

解決方案5
7 2016-11-28 22:26:10