简体   繁体   中英

Best way to update first row of large CSV files in Amazon s3

I have 15 large files > 5GB. The header from these 15 large CSV files is missing and we need to inject it as the first row in each of the files. What is the most intelligent way to do this?

Currently, I have an S3 cp command running sed into the file, but it's slow and time consuming. Is there a better approach? The data is gzipped

I suppose if you do not save the file to disk, which aws s3 cp does, you could speed things up. (Though perhaps you are using a shell process substitution to avoid saving to disk.)

If you are open to using the AWS Python SDK, boto3, you could stream the response. But if you want to avoid loading the entire file into memory, you will need to use a multipart upload which is kind of a pain to manage.

This question indicates you could concatenate your header with the file without decompressing the large file, which could speed things up.

Putting those two ideas together, here is an example.

import boto3
import gzip

s3 = boto3.client("3")
bucket = "mybucket"
key = "mykey.csv.gz"
new_key = "mykey2.csv.gz"

my_header = "Name,Date,Score".encode("utf-8")
header_compressed = gzip.compress(my_header)

r = s3.get_object(Bucket=bucket, Key=key)
output = [header_compressed]
for chunk in r["Body"].iter_chunks():
    output.append(chunk)

s3.put_object(Bucket=bucket, Key=new_key, Body=b"".join(output))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM