pandas 数据帧上的 s3fs gzip 压缩

Question

I'm trying to write a dataframe as a CSV file on S3 by using the s3fs library and pandas.我正在尝试使用s3fs库和 pandas 在 S3 上将数据帧编写为 CSV 文件。 Despite the documentation, I'm afraid the gzip compression parameter it's not working with s3fs.尽管有文档，但恐怕 gzip 压缩参数不适用于 s3fs。

def DfTos3Csv (df,file):
    with fs.open(file,'wb') as f:
       df.to_csv(f, compression='gzip', index=False)

This code saves the dataframe as a new object in S3 but in a plain CSV not in a gzip format.此代码将数据帧保存为 S3 中的新对象，但保存为纯 CSV 而非 gzip 格式。 On the other hand, the read functionality it's working OK using this compression parameter.另一方面，使用此压缩参数可以正常工作的读取功能。

def s3CsvToDf(file):
   with fs.open(file) as f:
      df = pd.read_csv(f, compression='gzip')
  return df

Suggestions/alternatives to the write issue?写入问题的建议/替代方案？ Thank you in advance!.先感谢您！。

Answer 1

The compression parameter of the function to_csv() does not work when writing to a stream.写入流时，函数to_csv()的压缩参数不起作用。 You have to do the zipping and uploading separately.您必须分别进行压缩和上传。

import gzip
import boto3
from io import BytesIO, TextIOWrapper

buffer = BytesIO()

with gzip.GzipFile(mode='w', fileobj=buffer) as zipped_file:
    df.to_csv(TextIOWrapper(zipped_file, 'utf8'), index=False)

s3_resource = boto3.resource('s3')
s3_object = s3_resource.Object('bucket_name', 'key')
s3_object.put(Body=buffer.getvalue())

Answer 2

pandas (v1.2.4) can write csv to S3 directly with compression functionality working properly. pandas (v1.2.4) 可以直接将 csv 写入 S3，压缩功能正常工作。 legacy pandas may have problem with compression.旧版熊猫可能存在压缩问题。 eg例如

your_pandas_dataframe.to_csv('s3://your_bucket_name/your_s3_key.csv.gz',compression="gzip", index=False) your_pandas_dataframe.to_csv('s3://your_bucket_name/your_s3_key.csv.gz',compression="gzip", index=False)

pandas 数据帧上的 s3fs gzip 压缩

问题描述

2 个解决方案

解决方案1
10 已采纳 2019-01-08 12:49:09

解决方案2
0 2021-05-14 06:07:03

pandas 数据帧上的 s3fs gzip 压缩

问题描述

2 个解决方案

解决方案1 10 已采纳 2019-01-08 12:49:09

解决方案2 0 2021-05-14 06:07:03

解决方案1
10 已采纳 2019-01-08 12:49:09

解决方案2
0 2021-05-14 06:07:03