[英]Convert pandas dataframe to parquet format and upload to s3 bucket
I have a list of parquet files that i need to copy from one S3 bucket to another s3 bucket in a different account.我有一个镶木地板文件列表,我需要将这些文件从一个 S3 存储桶复制到另一个帐户中的另一个 s3 存储桶。 I have to add a few columns to the parquet files before I upload.
在上传之前,我必须在镶木地板文件中添加几列。 I am trying to read files to a pandas dataframe and I am adding columns and converting it parquet but it does not seem to work.
我正在尝试将文件读取到 pandas dataframe 并且我正在添加列并将其转换为镶木地板,但它似乎不起作用。
here is what I am trying.这是我正在尝试的。 my_parquet_list is where I am getting the list of all keys.
my_parquet_list 是我获取所有键列表的地方。
for file in my_parquet_list:
bucket = 'source_bucket_name'
buffer = io.BytesIO()
s3 = session.resource('s3')
s3_obj = s3.Object(bucket,file)
s3_obj.download_fileobj(buffer)
df = pd.read_parquet(buffer)
df["col_new"] = 'xyz'
df["date"] = datetime.datetime.utcnow()
df.to_parquet(buffer, engine= 'pyarrow', index = False)
bucketdest = 'dest_bucket_name'
s3_file = 's3_folder_path/'+'.parquet'
print(s3_file)
s3.Object(bucketdest, s3_file).put(Body=buffer.getvalue())
print('loaded')
Just pip install s3fs
, then configure you aws CLI, finally you can just use df.to_parquet('s3://bucket_name/output-dir/df.parquet.gzip',index=False)
只需
pip install s3fs
,然后配置您的 aws CLI,最后您就可以使用df.to_parquet('s3://bucket_name/output-dir/df.parquet.gzip',index=False)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.