如何从 s3 stream 一个大的 gzipped.tsv 文件，处理它，然后写回 s3 上的新文件？

Question

I have a large file s3://my-bucket/in.tsv.gz that I would like to load and process, write back its processed version to an s3 output file s3://my-bucket/out.tsv.gz .我有一个大文件s3://my-bucket/in.tsv.gz我想加载和处理，将其处理后的版本写回 s3 output 文件s3://my-bucket/out.tsv.gz .

How do I streamline the in.tsv.gz directly from s3 without loading all the file to memory (it cannot fit the memory)如何直接从 s3 简化in.tsv.gz而不将所有文件加载到 memory （它不适合内存）
How do I write the processed gzipped stream directly to s3?如何将处理后的 gzip 压缩 stream 直接写入 s3？

In the following code, I show how I was thinking to load the input gzipped dataframe from s3, and how I would write the .tsv if it were located locally bucket_dir_local =./ .在下面的代码中，我展示了我是如何考虑从 s3 加载输入 gzip 压缩的 dataframe，以及如果.tsv位于本地bucket_dir_local =./ ，我将如何编写它。

import pandas as pd
import s3fs
import os
import gzip
import csv
import io

bucket_dir = 's3://my-bucket/annotations/'
df = pd.read_csv(os.path.join(bucket_dir, 'in.tsv.gz'), sep='\t', compression="gzip")

bucket_dir_local='./'
# not sure how to do it with an s3 path
with gzip.open(os.path.join(bucket_dir_local, 'out.tsv.gz'), "w") as f:
    with io.TextIOWrapper(f, encoding='utf-8') as wrapper:
        w = csv.DictWriter(wrapper, fieldnames=['test', 'testing'], extrasaction="ignore")
        w.writeheader()
        for index, row in df.iterrows():
            my_dict = {"test": index, "testing": row[6]}
            w.writerow(my_dict)

Edit : smart_open looks like the way to go.编辑： smart_open看起来像 go 的方式。

Answer 1

For downloading the file you can stream the S3 object directly in python .要下载文件，您可以直接在 python 中 ZF7B44CFFAFD5C52223D5498196C8A2E7BZ S3 object 。 I'd recommend reading that entire post but some key lines from it我建议阅读整篇文章，但其中的一些关键内容

import boto3

s3 = boto3.client('s3', aws_access_key_id='mykey', aws_secret_access_key='mysecret') # your authentication may vary
obj = s3.get_object(Bucket='my-bucket', Key='my/precious/object')

import gzip

body = obj['Body']

with gzip.open(body, 'rt') as gf:
    for ln in gf:
        process(ln)

Unfortunately S3 doesn't support true streaming input but this SO answer has an implementation that chunks out the file and sends each chunk up to S3.不幸的是，S3 不支持真正的流输入，但是这个 SO 答案有一个实现，可以将文件分块并将每个块发送到 S3。 While not a "true stream" it will let you upload large files without needing to keep the entire thing in memory虽然不是“真正的流”，但它可以让您上传大文件，而无需将整个内容保存在 memory

Answer 2

Here is a dummy example to read a file from s3 and write it back to s3 using smart_open这是一个从 s3 读取文件并使用smart_open将其写回 s3 的虚拟示例

from smart_open import open
import os

bucket_dir = "s3://my-bucket/annotations/"

with open(os.path.join(bucket_dir, "in.tsv.gz"), "rb") as fin:
    with open(
        os.path.join(bucket_dir, "out.tsv.gz"), "wb"
    ) as fout:
        for line in fin:
            l = [i.strip() for i in line.decode().split("\t")]
            string = "\t".join(l) + "\n"
            fout.write(string.encode())

如何从 s3 stream 一个大的 gzipped.tsv 文件，处理它，然后写回 s3 上的新文件？

问题描述

2 个解决方案

解决方案1
2 2020-11-30 04:51:52

解决方案2
1 已采纳 2020-12-02 06:44:56

如何从 s3 stream 一个大的 gzipped.tsv 文件，处理它，然后写回 s3 上的新文件？

问题描述

2 个解决方案

解决方案1 2 2020-11-30 04:51:52

解决方案2 1 已采纳 2020-12-02 06:44:56

解决方案1
2 2020-11-30 04:51:52

解决方案2
1 已采纳 2020-12-02 06:44:56