简体   繁体   English

将 pickle 文件写入 AWS 中的 s3 存储桶

[英]Writing a pickle file to an s3 bucket in AWS

I'm trying to write a pandas dataframe as a pickle file into an s3 bucket in AWS.我正在尝试将 pandas dataframe 作为 pickle 文件写入 AWS 中的 s3 存储桶中。 I know that I can write dataframe new_df as a csv to an s3 bucket as follows:我知道我可以将 dataframe new_df作为 csv 写入 s3 存储桶,如下所示:

bucket='mybucket'
key='path'

csv_buffer = StringIO()
s3_resource = boto3.resource('s3')

new_df.to_csv(csv_buffer, index=False)
s3_resource.Object(bucket,path).put(Body=csv_buffer.getvalue())

I've tried using the same code as above with to_pickle() but with no success.我试过将与上面相同的代码与to_pickle()一起使用,但没有成功。

Further to you answer, you don't need to convert to csv.除了你回答,你不需要转换为 csv。 pickle.dumps method returns a byte obj. pickle.dumps 方法返回一个字节 obj。 see here: https://docs.python.org/3/library/pickle.html见这里: https ://docs.python.org/3/library/pickle.html

import boto3
import pickle

bucket='your_bucket_name'
key='your_pickle_filename.pkl'
pickle_byte_obj = pickle.dumps([var1, var2, ..., varn]) 
s3_resource = boto3.resource('s3')
s3_resource.Object(bucket,key).put(Body=pickle_byte_obj)

I've found the solution, need to call BytesIO into the buffer for pickle files instead of StringIO (which are for CSV files).我找到了解决方案,需要将 BytesIO 调用到 pickle 文件的缓冲区中,而不是 StringIO (用于 CSV 文件)。

import io
import boto3

pickle_buffer = io.BytesIO()
s3_resource = boto3.resource('s3')

new_df.to_pickle(pickle_buffer)
s3_resource.Object(bucket, key).put(Body=pickle_buffer.getvalue())

this worked for me with pandas 0.23.4 and boto3 1.7.80 :这对我有用 pandas 0.23.4 和 boto3 1.7.80 :

bucket='your_bucket_name'
key='your_pickle_filename.pkl'
new_df.to_pickle(key)
s3_resource.Object(bucket, key).put(Body=open(key, 'rb'))

This solution (using s3fs) worked perfectly and elegantly for my team:这个解决方案(使用 s3fs)对我的团队来说完美而优雅:

import s3fs
from pickle import dump

fs = s3fs.S3FileSystem(anon=False)

bucket = 'bucket1'
key = 'your_pickle_filename.pkl'

dump(data, fs.open(f's3://{bucket}/{key}', 'wb'))

This adds some clarification to a previous answer:这为先前的答案增加了一些说明:

import pandas as pd
import boto3

# make df
df = pd.DataFrame({'col1:': [1,2,3]})

# bucket name
str_bucket = 'bucket_name'
# filename
str_key_file = 'df.pkl'
# bucket path
str_key_bucket = dir_1/dir2/{str_key_file}'

# write df to local pkl file
df.to_pickle(str_key_file)

# put object into s3
boto3.resource('s3').Object(str_bucket, str_key_bucket).put(Body=open(str_key_file, 'rb'))

From the just-released book 'Time Series Analysis with Python' by Tarek Atwan, I learned this method:从 Tarek Atwan 刚刚出版的《使用 Python 进行时间序列分析》一书中,我学到了这种方法:

            import pandas as pd
            df = pd.DataFrame(...)

            df.to_pickle('s3://mybucket/pklfile.bz2',
                   storage_options={
                       'key': AWS_ACCESS_KEY,
                       'secret': AWS_SECRET_KEY
                   }
            )

which I believe is more pythonic.我相信这更像是pythonic。 I add that this works for me most of the time, but sometimes it throws a "PermissionError: Access Denied" exception that I cannot explain (as I am new to AWS/S3, I must be missing something in the setup).我补充说,这在大多数情况下对我有用,但有时它会引发我无法解释的“PermissionError:Access Denied”异常(因为我是 AWS/S3 的新手,我必须在设置中遗漏一些东西)。

I've found the best solution - just upgrade pandas and also install s3fs:我找到了最好的解决方案——只需升级 pandas 并安装 s3fs:

pip install s3fs==2022.8.2
pip install install pandas==1.1.5


bucket,key='mybucket','path'


df.to_pickle(f"{bucket}{key}.pkl.gz", compression='gzip')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM