简体   繁体   English

将 Pandas 数据帧写入 s3 AWS 中的镶木地板

[英]Write pandas dataframe to parquet in s3 AWS

I want to write my dataframe in my s3 bucket in a parquet format.我想以镶木地板格式在我的 s3 存储桶中写入我的数据帧。 I know how to write the dataframe in a csv format.我知道如何以 csv 格式编写数据帧。 But I don't know how to write in parquet format.但我不知道如何以镶木地板格式书写。 Here is the code for the csv format (I don't display the fields ServerSideEncryption and SSEKMSKeyId but I use them in my actual code ) :这是 csv 格式的代码(我不显示字段 ServerSideEncryption 和 SSEKMSKeyId 但我在实际代码中使用它们):

csv_to_write = df.to_csv(None).encode()
s3_client.put_object(Bucket=bucket_name,Key='data.csv', Body=csv_to_write,
              ServerSideEncryption='XXXXX', SSEKMSKeyId='XXXXXXXX')

Does someone have the equivalent for parquet ?有人有镶木地板的等价物吗? Thanks谢谢

For python 3.6+, AWS has a library called aws-data-wrangler that helps with the integration between Pandas/S3/Parquet对于 python 3.6+,AWS 有一个名为aws-data-wrangler的库,可以帮助 Pandas/S3/Parquet 之间的集成

to install do;安装做;

pip install awswrangler

if you want to write your pandas dataframe as a parquet file to S3 do;如果您想将您的 Pandas 数据帧作为镶木地板文件写入 S3,请执行;

import awswrangler as wr
wr.s3.to_parquet(
    dataframe=df,
    path="s3://my-bucket/key/my-file.parquet"
)

if you want to add encryption do;如果你想添加加密做;

import awswrangler as wr
extra_args = {
    "ServerSideEncryption": "aws:kms",
    "SSEKMSKeyId": "YOUR_KMS_KEY_ARN"
}
sess = wr.Session(s3_additional_kwargs=extra_args)
sess.s3.to_parquet(
    dataframe=df,
    path="s3://my-bucket/key/my-file.parquet"
)

Assuming your dataframe is called df, use the following code to first convert it to parquet format and store it.假设您的数据帧名为 df,请使用以下代码首先将其转换为 parquet 格式并存储。 Then upload this parquet file on s3.然后在 s3 上上传这个镶木地板文件。

import pyarrow as pa
import pyarrow.parquet as pq
import boto3

parquet_table = pa.Table.from_pandas(df)
pq.write_table(parquet_table, local_file_name)

s3 = boto3.client('s3',aws_access_key_id='XXX',aws_secret_access_key='XXX')
s3.upload_file(local_file_name, bucket_name, remote_file_name)

Excellent solution above with use of AWS Wrangler, but I did get an error when I attempted to use example above, assuming the lib has changed.上面使用 AWS Wrangler 的优秀解决方案,但是当我尝试使用上面的示例时确实出现错误,假设 lib 已更改。 The following worked for me:以下对我有用:

wr.s3.to_parquet(df, path=f"s3://{output_bucket}/{output_key}.parquet", index=False)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM