Best way to update first row of large CSV files in Amazon s3

Question

I have 15 large files > 5GB. The header from these 15 large CSV files is missing and we need to inject it as the first row in each of the files. What is the most intelligent way to do this?

Currently, I have an S3 cp command running sed into the file, but it's slow and time consuming. Is there a better approach? The data is gzipped

Answer 1

I suppose if you do not save the file to disk, which aws s3 cp does, you could speed things up. (Though perhaps you are using a shell process substitution to avoid saving to disk.)

If you are open to using the AWS Python SDK, boto3, you could stream the response. But if you want to avoid loading the entire file into memory, you will need to use a multipart upload which is kind of a pain to manage.

This question indicates you could concatenate your header with the file without decompressing the large file, which could speed things up.

Putting those two ideas together, here is an example.

import boto3
import gzip

s3 = boto3.client("3")
bucket = "mybucket"
key = "mykey.csv.gz"
new_key = "mykey2.csv.gz"

my_header = "Name,Date,Score".encode("utf-8")
header_compressed = gzip.compress(my_header)

r = s3.get_object(Bucket=bucket, Key=key)
output = [header_compressed]
for chunk in r["Body"].iter_chunks():
    output.append(chunk)

s3.put_object(Bucket=bucket, Key=new_key, Body=b"".join(output))

Best way to update first row of large CSV files in Amazon s3

Question

1 answers

solution1
0 2021-09-28 05:28:51

Best way to update first row of large CSV files in Amazon s3

Question

1 answers

solution1 0 2021-09-28 05:28:51

solution1
0 2021-09-28 05:28:51