简体   繁体   English

如何使用 python 和 boto3 即时生成 stream 数据并将其写入 s3?

[英]how generate and write stream data to s3 on the fly with python and boto3?

How to write dynamically generated data on the fly to S3 by chunks with python and boto3 ?如何使用pythonboto3将动态生成的数据动态写入S3

I want to realise something like this:我想实现这样的事情:

from io import BytesIO
from boto3 import ???

s3_opened_stream = ???

for i in ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'x', 'y', 'z']:
  data = (i*1000).decode('utf-8')
  s3_opened_stream.append_chunk(BytesIO(data))

# OR something like

with ??? as s3_opened_stream:
  for i in ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'x', 'y', 'z']:
    data = (i*1000).decode('utf-8')
    s3_opened_stream.append_chunk(BytesIO(data))

And expeect to see resulted file like:并期望看到结果文件,如:

aaaaaa......
bbbbbb......
cccccc......
.....

Where every line will be appended to same S3 object.其中每一行都将附加到相同的 S3 object。

I checked examples in the internet and everywhere data was generated fully at first step and after generating uploaded to S3.我检查了互联网上的示例,并且在第一步和生成上传到 S3 之后,到处都完全生成了数据。

I tried to use these examples like:我尝试使用这些示例,例如:

from io import BytesIO
from boto3.s3.transfer import TransferConfig
from boto3 import resource

config = TransferConfig(
    # set possible lower size to force multipart-upload in any case
    multipart_threshold=1, 
    max_concurrency=1,
    multipart_chunksize=5242880,
    use_threads=False
)

bucket = resource(
    service_name='s3',
    region_name=params['region_name'],
    endpoint_url=params['endpoint_url'],
    aws_access_key_id=params['aws_access_key_id'],
    aws_secret_access_key=params['aws_secret_access_key']
).Bucket(params['bucket_name'])

with BytesIO() as one_chunk:
    for line in lines:
       # write new line inside one_chunk
       ...
    
       # write data to object
       bucket.upload_fileobj(one_chunk, obj_path, Config=config, Callback=None)
    
       # clear chunk data to release RAM
       one_chunk.truncate(0)

But upload_fileobj everytime rewrites object with new line instead of append to it.但是upload_fileobj每次都用新行而不是 append 重写 object 。

In other words I want to open S3 object in append mode (like with open('path', mode='a') ) and append lines that will be generated in loop.换句话说,我想在 append 模式下打开 S3 object(如with open('path', mode='a') )和将在循环中生成的 append 行。 Because actual resulted file is very big and can't be stored in RAM memory in full因为实际生成的文件非常大,无法完整存储在 RAM memory 中

Finally I give up to try understand boto3 code.最后我放弃尝试理解boto3代码。 It is pretty complicated and classes are not simply extendable.它非常复杂,并且类不能简单地扩展。

Looks like smart_open is most easy solution:看起来smart_open是最简单的解决方案:

I checked this code with ~ 4GB input file我用4GB的输入文件检查了这段代码

from boto3 import Session
from smart_open import open

c = Session(
    aws_access_key_id=id,
    aws_secret_access_key=key
).client('s3', endpoint_url='http://minio.local:9000')  # I use minio for testing

read_path="bucket_name/in.csv"
write_path="bucket_name/out.csv"
with open(f"s3://{read_path}", mode='rb', transport_params={'client': c}) as fr:
    with open(f"s3://{write_path}", mode='wb', transport_params={'client': c}) as fw:
        for line in fr:
            fw.write(line)

And it is working like a charm.它就像一个魅力。 Memory usage was about ~ 350MB at peak. 350MB使用量峰值约为 350MB。 (Checked by htop 's RES value) (通过htopRES值检查)

RES: How much physical RAM the process is using, measured in kilobytes. RES:进程使用了多少物理 RAM,以千字节为单位。

RES stands for the resident size, which is an accurate representation of how much actual physical memory a process is consuming. RES 代表驻留大小,它准确表示一个进程正在消耗多少实际物理 memory。 (This also corresponds directly to the %MEM column) (这也直接对应于 %MEM 列)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM