更新 Amazon s3 中 CSV 大文件第一行的最佳方法

Question

I have 15 large files > 5GB.我有 15 个大于 5GB 的大文件。 The header from these 15 large CSV files is missing and we need to inject it as the first row in each of the files.这 15 个 CSV 大文件中的 header 丢失了，我们需要将其作为每个文件的第一行注入。 What is the most intelligent way to do this?最聪明的方法是什么？

Currently, I have an S3 cp command running sed into the file, but it's slow and time consuming.目前，我有一个在文件中运行 sed 的 S3 cp 命令，但它速度慢且耗时。 Is there a better approach?有更好的方法吗？ The data is gzipped数据被 gzip 压缩

Answer 1

I suppose if you do not save the file to disk, which aws s3 cp does, you could speed things up.我想如果你不将文件保存到磁盘，而aws s3 cp会这样做，你可以加快速度。 (Though perhaps you are using a shell process substitution to avoid saving to disk.) （尽管您可能正在使用 shell 进程替换来避免保存到磁盘。）

If you are open to using the AWS Python SDK, boto3, you could stream the response.如果您愿意使用 AWS Python SDK，boto3，您可以 stream 响应。 But if you want to avoid loading the entire file into memory, you will need to use a multipart upload which is kind of a pain to manage.但是如果你想避免将整个文件加载到 memory 中，你将需要使用分段上传，这有点难以管理。

This question indicates you could concatenate your header with the file without decompressing the large file, which could speed things up. 这个问题表明您可以将 header 与文件连接起来，而无需解压缩大文件，这可以加快速度。

Putting those two ideas together, here is an example.将这两个想法放在一起，这是一个例子。

import boto3
import gzip

s3 = boto3.client("3")
bucket = "mybucket"
key = "mykey.csv.gz"
new_key = "mykey2.csv.gz"

my_header = "Name,Date,Score".encode("utf-8")
header_compressed = gzip.compress(my_header)

r = s3.get_object(Bucket=bucket, Key=key)
output = [header_compressed]
for chunk in r["Body"].iter_chunks():
    output.append(chunk)

s3.put_object(Bucket=bucket, Key=new_key, Body=b"".join(output))

更新 Amazon s3 中 CSV 大文件第一行的最佳方法

问题描述

1 个解决方案

解决方案1
0 2021-09-28 05:28:51

更新 Amazon s3 中 CSV 大文件第一行的最佳方法

问题描述

1 个解决方案

解决方案1 0 2021-09-28 05:28:51

解决方案1
0 2021-09-28 05:28:51