简体   繁体   English

将 Pandas DataFrame 写入 S3 作为 Pickle

[英]Write Pandas DataFrame To S3 as Pickle

Here are my requirements.这是我的要求。

  • Upload a pandas dataframe to AWS S3 as a pickle file将 pandas dataframe 作为 pickle 文件上传到 AWS S3
  • Due to environment reasons, boto3 must be used and alternatives such as s3fs are not an option由于环境原因,必须使用 boto3 并且不能选择 s3fs 等替代方案
  • Data must exist in memory, and writing to temporary files is not possible memory中必须存在数据,不能写入临时文件

I created the following simple function that uploads a Pandas dataframe to s3 as a csv:我创建了以下简单的 function 将 Pandas dataframe 作为 Z628CB5675FF5288FE3E7 上传到 s3:

def df_to_s3_csv(df, filename, sep=','):
    s3 = boto3.resource('s3')
    buffer = io.StringIO()
    df.to_csv(buffer, sep=sep, index=False)
    s3.Object(s3bucket, f'{s3_upload_path}/{filename}').put(Body=buffer.getvalue())

This function works fine and does what it is supposed to.这个 function 工作正常,并且做它应该做的事情。 For the pickle file, I created the following function in a similar manner:对于泡菜文件,我以类似的方式创建了以下 function:

def df_to_s3_pckl(df, filename):
    s3 = boto3.resource('s3')
    buffer = io.BytesIO()
    df.to_pickle(buffer)
    buffer.seek(0)
    obj = s3.Object(s3bucket, f'{s3_upload_path}/{filename}')
    obj.put(Body=buffer.getvalue())

I tried this function with and without the seek portion and either way it throws the following error: ValueError: I/O operation on closed file.我尝试了这个 function 有和没有seek部分,无论哪种方式它都会引发以下错误: ValueError: I/O operation on closed file.

Looking further into the issue, I found that buffer is considered closed as soon as df.to_pickle is called.进一步研究这个问题,我发现一旦df.to_pickle被调用, buffer就被认为是closed的。 This is reproducible by issuing these commands:这可以通过发出以下命令来重现:

buffer = io.BytesIO()
df.to_pickle(buffer)
print(buffer.closed)

The above prints True .上面打印True It appears that the BytesIO buffer is closed by to_pickle and therefore its data cannot be referenced.似乎BytesIO缓冲区已被to_pickle关闭,因此无法引用其数据。 How can this issue be resolved, or is there an alternative that meets my requirements?如何解决此问题,或者是否有满足我要求的替代方案? I've found several questions on SO about how to upload to S3 using boto3, but nothing regarding how to upload pickle files created by Pandas using BytesIO buffers.我在 SO 上发现了几个关于如何使用 boto3 上传到 S3 的问题,但没有关于如何使用 BytesIO 缓冲区上传由 Pandas 创建的泡菜文件。

Here is a minimal reproducible example of the underlying issue:这是潜在问题的最小可重现示例:

import pandas as pd
import numpy as np
import io
df = pd.DataFrame(np.random.randint(0,100,size=(4,4)))                                                                                   
buffer = io.BytesIO()
df.to_pickle(buffer)
print(buffer.closed)

It appears that the issue can be traced to the pandas source code .看来问题可以追溯到pandas 源代码 This may ultimately be a bug in pandas revealed by unanticipated usage of a BytesIO object in the to_pickle method.这最终可能是 pandas 中的一个错误,由to_pickle方法中意外使用BytesIO object 所揭示。 I managed to circumvent the issue in the minimal reproducible example with the following code, which uses the dump method from the pickle module:我设法使用以下代码在最小可重现示例中规避了该问题,该代码使用了pickle模块中的dump方法:

import pandas as pd
import numpy as np
import io
from pickle import dump
df = pd.DataFrame(np.random.randint(0,100,size=(4,4)))
buffer = io.BytesIO()
dump(df, buffer)
buffer.seek(0)
print(buffer.closed)

Now the print statement prints False and the BytesIO stream data can be accessed.现在 print 语句打印False并且可以访问BytesIO stream 数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM