如何使用 python 和 boto3 即时生成 stream 数据并将其写入 s3？

Question

How to write dynamically generated data on the fly to S3 by chunks with python and boto3 ?如何使用python和boto3将动态生成的数据动态写入S3 ？

I want to realise something like this:我想实现这样的事情：

from io import BytesIO
from boto3 import ???

s3_opened_stream = ???

for i in ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'x', 'y', 'z']:
  data = (i*1000).decode('utf-8')
  s3_opened_stream.append_chunk(BytesIO(data))

# OR something like

with ??? as s3_opened_stream:
  for i in ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'x', 'y', 'z']:
    data = (i*1000).decode('utf-8')
    s3_opened_stream.append_chunk(BytesIO(data))

And expeect to see resulted file like:并期望看到结果文件，如：

aaaaaa......
bbbbbb......
cccccc......
.....

Where every line will be appended to same S3 object.其中每一行都将附加到相同的 S3 object。

I checked examples in the internet and everywhere data was generated fully at first step and after generating uploaded to S3.我检查了互联网上的示例，并且在第一步和生成上传到 S3 之后，到处都完全生成了数据。

I tried to use these examples like:我尝试使用这些示例，例如：

from io import BytesIO
from boto3.s3.transfer import TransferConfig
from boto3 import resource

config = TransferConfig(
    # set possible lower size to force multipart-upload in any case
    multipart_threshold=1, 
    max_concurrency=1,
    multipart_chunksize=5242880,
    use_threads=False
)

bucket = resource(
    service_name='s3',
    region_name=params['region_name'],
    endpoint_url=params['endpoint_url'],
    aws_access_key_id=params['aws_access_key_id'],
    aws_secret_access_key=params['aws_secret_access_key']
).Bucket(params['bucket_name'])

with BytesIO() as one_chunk:
    for line in lines:
       # write new line inside one_chunk
       ...
    
       # write data to object
       bucket.upload_fileobj(one_chunk, obj_path, Config=config, Callback=None)
    
       # clear chunk data to release RAM
       one_chunk.truncate(0)

But upload_fileobj everytime rewrites object with new line instead of append to it.但是upload_fileobj每次都用新行而不是 append 重写 object 。

In other words I want to open S3 object in append mode (like with open('path', mode='a') ) and append lines that will be generated in loop.换句话说，我想在 append 模式下打开 S3 object（如with open('path', mode='a') ）和将在循环中生成的 append 行。 Because actual resulted file is very big and can't be stored in RAM memory in full因为实际生成的文件非常大，无法完整存储在 RAM memory 中

Answer 1

Finally I give up to try understand boto3 code.最后我放弃尝试理解boto3代码。 It is pretty complicated and classes are not simply extendable.它非常复杂，并且类不能简单地扩展。

Looks like smart_open is most easy solution:看起来smart_open是最简单的解决方案：

I checked this code with ~ 4GB input file我用4GB的输入文件检查了这段代码

from boto3 import Session
from smart_open import open

c = Session(
    aws_access_key_id=id,
    aws_secret_access_key=key
).client('s3', endpoint_url='http://minio.local:9000')  # I use minio for testing

read_path="bucket_name/in.csv"
write_path="bucket_name/out.csv"
with open(f"s3://{read_path}", mode='rb', transport_params={'client': c}) as fr:
    with open(f"s3://{write_path}", mode='wb', transport_params={'client': c}) as fw:
        for line in fr:
            fw.write(line)

And it is working like a charm.它就像一个魅力。 Memory usage was about ~ 350MB at peak. 350MB使用量峰值约为 350MB。 (Checked by htop 's RES value) （通过htop的RES值检查）

RES: How much physical RAM the process is using, measured in kilobytes. RES：进程使用了多少物理 RAM，以千字节为单位。

RES stands for the resident size, which is an accurate representation of how much actual physical memory a process is consuming. RES 代表驻留大小，它准确表示一个进程正在消耗多少实际物理 memory。 (This also corresponds directly to the %MEM column) （这也直接对应于 %MEM 列）

如何使用 python 和 boto3 即时生成 stream 数据并将其写入 s3？

问题描述

1 个解决方案

解决方案1
0 2022-01-13 06:03:58

如何使用 python 和 boto3 即时生成 stream 数据并将其写入 s3？

问题描述

1 个解决方案

解决方案1 0 2022-01-13 06:03:58

解决方案1
0 2022-01-13 06:03:58