简体   繁体   English

将 pandas dataframe 转换为 parquet 格式并上传到 s3 bucket

[英]Convert pandas dataframe to parquet format and upload to s3 bucket

I have a list of parquet files that i need to copy from one S3 bucket to another s3 bucket in a different account.我有一个镶木地板文件列表,我需要将这些文件从一个 S3 存储桶复制到另一个帐户中的另一个 s3 存储桶。 I have to add a few columns to the parquet files before I upload.在上传之前,我必须在镶木地板文件中添加几列。 I am trying to read files to a pandas dataframe and I am adding columns and converting it parquet but it does not seem to work.我正在尝试将文件读取到 pandas dataframe 并且我正在添加列并将其转换为镶木地板,但它似乎不起作用。

here is what I am trying.这是我正在尝试的。 my_parquet_list is where I am getting the list of all keys. my_parquet_list 是我获取所有键列表的地方。

for file in my_parquet_list: 
    bucket = 'source_bucket_name'
    buffer = io.BytesIO()
    s3 = session.resource('s3')
    s3_obj = s3.Object(bucket,file)
    s3_obj.download_fileobj(buffer)
    df = pd.read_parquet(buffer)
    df["col_new"] = 'xyz'
    df["date"] = datetime.datetime.utcnow()
    df.to_parquet(buffer, engine= 'pyarrow', index = False)
    bucketdest = 'dest_bucket_name'
    s3_file = 's3_folder_path/'+'.parquet'
    print(s3_file)
    s3.Object(bucketdest, s3_file).put(Body=buffer.getvalue())
    print('loaded')

Just pip install s3fs , then configure you aws CLI, finally you can just use df.to_parquet('s3://bucket_name/output-dir/df.parquet.gzip',index=False)只需pip install s3fs ,然后配置您的 aws CLI,最后您就可以使用df.to_parquet('s3://bucket_name/output-dir/df.parquet.gzip',index=False)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将 Pandas Dataframe 转换为镜像并直接上传到 S3 存储桶,无需本地保存使用 Python - Convert Pandas Dataframe to Image and Upload Directly to S3 Bucket Without Saving Locally Using Python 将 Pandas 数据帧写入 s3 AWS 中的镶木地板 - Write pandas dataframe to parquet in s3 AWS 当S3是目的地时,pandas.DataFrame.to_parquet失败 - pandas.DataFrame.to_parquet fails when S3 is the destination 使用 AWS S3 存储桶将多个对象连接到单个 Pandas Dataframe - Concatenating Multiple Objects into a single Pandas Dataframe with AWS S3 Bucket 将 Pandas dataframe.groupby 结果写入 S3 存储桶 - Writing Pandas dataframe.groupby results to S3 bucket 将 pandas 数据帧作为压缩的 CSV 直接写入 Amazon s3 存储桶? - Write pandas dataframe as compressed CSV directly to Amazon s3 bucket? 使用 Python 将标准 JSON 文件转换为 json-serde 格式并上传到 Amazon Athena(Presto、Hive)的 AWS S3 存储桶 - Convert standard JSON file to json-serde format using Python & upload to AWS S3 bucket for Amazon Athena (Presto, Hive) 如何在不使用 Pandas 的情况下从 s3 存储桶获取 excel 文件并将文件再次上传到 s3 存储桶 - Python - How to get an excel file from s3 bucket and upload the file again to s3 bucket without using pandas - Python 将 Pandas Parquet 分区为 s3 - write pandas parquet partitioned into s3 pandas 使用 append 将 dataframe 写入镶木地板格式 - pandas write dataframe to parquet format with append
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM