[英]Best way to update first row of large CSV files in Amazon s3
I have 15 large files > 5GB.我有 15 个大于 5GB 的大文件。 The header from these 15 large CSV files is missing and we need to inject it as the first row in each of the files.
这 15 个 CSV 大文件中的 header 丢失了,我们需要将其作为每个文件的第一行注入。 What is the most intelligent way to do this?
最聪明的方法是什么?
Currently, I have an S3 cp command running sed into the file, but it's slow and time consuming.目前,我有一个在文件中运行 sed 的 S3 cp 命令,但它速度慢且耗时。 Is there a better approach?
有更好的方法吗? The data is gzipped
数据被 gzip 压缩
I suppose if you do not save the file to disk, which aws s3 cp
does, you could speed things up.我想如果你不将文件保存到磁盘,而
aws s3 cp
会这样做,你可以加快速度。 (Though perhaps you are using a shell process substitution to avoid saving to disk.) (尽管您可能正在使用 shell 进程替换来避免保存到磁盘。)
If you are open to using the AWS Python SDK, boto3, you could stream the response.如果您愿意使用 AWS Python SDK,boto3,您可以 stream 响应。 But if you want to avoid loading the entire file into memory, you will need to use a multipart upload which is kind of a pain to manage.
但是如果你想避免将整个文件加载到 memory 中,你将需要使用分段上传,这有点难以管理。
This question indicates you could concatenate your header with the file without decompressing the large file, which could speed things up. 这个问题表明您可以将 header 与文件连接起来,而无需解压缩大文件,这可以加快速度。
Putting those two ideas together, here is an example.将这两个想法放在一起,这是一个例子。
import boto3
import gzip
s3 = boto3.client("3")
bucket = "mybucket"
key = "mykey.csv.gz"
new_key = "mykey2.csv.gz"
my_header = "Name,Date,Score".encode("utf-8")
header_compressed = gzip.compress(my_header)
r = s3.get_object(Bucket=bucket, Key=key)
output = [header_compressed]
for chunk in r["Body"].iter_chunks():
output.append(chunk)
s3.put_object(Bucket=bucket, Key=new_key, Body=b"".join(output))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.