简体   繁体   English

更新 Amazon s3 中 CSV 大文件第一行的最佳方法

[英]Best way to update first row of large CSV files in Amazon s3

I have 15 large files > 5GB.我有 15 个大于 5GB 的大文件。 The header from these 15 large CSV files is missing and we need to inject it as the first row in each of the files.这 15 个 CSV 大文件中的 header 丢失了,我们需要将其作为每个文件的第一行注入。 What is the most intelligent way to do this?最聪明的方法是什么?

Currently, I have an S3 cp command running sed into the file, but it's slow and time consuming.目前,我有一个在文件中运行 sed 的 S3 cp 命令,但它速度慢且耗时。 Is there a better approach?有更好的方法吗? The data is gzipped数据被 gzip 压缩

I suppose if you do not save the file to disk, which aws s3 cp does, you could speed things up.我想如果你不将文件保存到磁盘,而aws s3 cp会这样做,你可以加快速度。 (Though perhaps you are using a shell process substitution to avoid saving to disk.) (尽管您可能正在使用 shell 进程替换来避免保存到磁盘。)

If you are open to using the AWS Python SDK, boto3, you could stream the response.如果您愿意使用 AWS Python SDK,boto3,您可以 stream 响应。 But if you want to avoid loading the entire file into memory, you will need to use a multipart upload which is kind of a pain to manage.但是如果你想避免将整个文件加载到 memory 中,你将需要使用分段上传,这有点难以管理。

This question indicates you could concatenate your header with the file without decompressing the large file, which could speed things up. 这个问题表明您可以将 header 与文件连接起来,而无需解压缩大文件,这可以加快速度。

Putting those two ideas together, here is an example.将这两个想法放在一起,这是一个例子。

import boto3
import gzip

s3 = boto3.client("3")
bucket = "mybucket"
key = "mykey.csv.gz"
new_key = "mykey2.csv.gz"

my_header = "Name,Date,Score".encode("utf-8")
header_compressed = gzip.compress(my_header)

r = s3.get_object(Bucket=bucket, Key=key)
output = [header_compressed]
for chunk in r["Body"].iter_chunks():
    output.append(chunk)

s3.put_object(Bucket=bucket, Key=new_key, Body=b"".join(output))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在 Amazon S3 中组织小文件的最佳方式是什么? - What's the best way to organize small files in Amazon S3? 将 FTP 个文件 (csv) 自动化到 Amazon S3 存储桶 - Automate FTP of files (csv) to Amazon S3 bucket 如何批量更新 Amazon S3 文件上数千个文件的元数据? - How to mass update metadata on thousands of files on Amazon S3 files? 在 Amazon Athena 中访问 S3 CSV 文件 - Access S3 CSV file in Amazon Athena 使用 C# 将大文件传输到 Amazon S3 - 请求中止并取消 - transferring large files to Amazon S3 using C# - Request aborted and Canceled static Amazon S3 文件的 GZIP 压缩 - GZIP Compression on static Amazon S3 files 从 Amazon S3 存储桶中的某个位置读取 AWS Sagemaker 中的多个 csv 文件 - Reading multiple csv files in AWS Sagemaker from a location in Amazon S3 Bucket 有没有办法从网站下载 csv 文件并使用 Lambda 将其直接上传到 Amazon S3? - Is there a way to download a csv file from a website and upload it directly to Amazon S3 using Lambda? 控制 S3 的最佳方式是什么? - What's the best way to control S3? 将大型数据集从 Amazon s3 传输到 Azure blob 存储 - Transfer large Datasets to Azure blob storage from Amazon s3
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM