[英]Write Pandas DataFrame To S3 as Pickle
Here are my requirements.这是我的要求。
I created the following simple function that uploads a Pandas dataframe to s3 as a csv:我创建了以下简单的 function 将 Pandas dataframe 作为 Z628CB5675FF5288FE3E7 上传到 s3:
def df_to_s3_csv(df, filename, sep=','):
s3 = boto3.resource('s3')
buffer = io.StringIO()
df.to_csv(buffer, sep=sep, index=False)
s3.Object(s3bucket, f'{s3_upload_path}/{filename}').put(Body=buffer.getvalue())
This function works fine and does what it is supposed to.这个 function 工作正常,并且做它应该做的事情。 For the pickle file, I created the following function in a similar manner:对于泡菜文件,我以类似的方式创建了以下 function:
def df_to_s3_pckl(df, filename):
s3 = boto3.resource('s3')
buffer = io.BytesIO()
df.to_pickle(buffer)
buffer.seek(0)
obj = s3.Object(s3bucket, f'{s3_upload_path}/{filename}')
obj.put(Body=buffer.getvalue())
I tried this function with and without the seek
portion and either way it throws the following error: ValueError: I/O operation on closed file.
我尝试了这个 function 有和没有seek
部分,无论哪种方式它都会引发以下错误: ValueError: I/O operation on closed file.
Looking further into the issue, I found that buffer
is considered closed
as soon as df.to_pickle
is called.进一步研究这个问题,我发现一旦df.to_pickle
被调用, buffer
就被认为是closed
的。 This is reproducible by issuing these commands:这可以通过发出以下命令来重现:
buffer = io.BytesIO()
df.to_pickle(buffer)
print(buffer.closed)
The above prints True
.上面打印True
。 It appears that the BytesIO
buffer is closed by to_pickle
and therefore its data cannot be referenced.似乎BytesIO
缓冲区已被to_pickle
关闭,因此无法引用其数据。 How can this issue be resolved, or is there an alternative that meets my requirements?如何解决此问题,或者是否有满足我要求的替代方案? I've found several questions on SO about how to upload to S3 using boto3, but nothing regarding how to upload pickle files created by Pandas using BytesIO buffers.我在 SO 上发现了几个关于如何使用 boto3 上传到 S3 的问题,但没有关于如何使用 BytesIO 缓冲区上传由 Pandas 创建的泡菜文件。
Here is a minimal reproducible example of the underlying issue:这是潜在问题的最小可重现示例:
import pandas as pd
import numpy as np
import io
df = pd.DataFrame(np.random.randint(0,100,size=(4,4)))
buffer = io.BytesIO()
df.to_pickle(buffer)
print(buffer.closed)
It appears that the issue can be traced to the pandas source code .看来问题可以追溯到pandas 源代码。 This may ultimately be a bug in pandas revealed by unanticipated usage of a BytesIO
object in the to_pickle
method.这最终可能是 pandas 中的一个错误,由to_pickle
方法中意外使用BytesIO
object 所揭示。 I managed to circumvent the issue in the minimal reproducible example with the following code, which uses the dump
method from the pickle
module:我设法使用以下代码在最小可重现示例中规避了该问题,该代码使用了pickle
模块中的dump
方法:
import pandas as pd
import numpy as np
import io
from pickle import dump
df = pd.DataFrame(np.random.randint(0,100,size=(4,4)))
buffer = io.BytesIO()
dump(df, buffer)
buffer.seek(0)
print(buffer.closed)
Now the print statement prints False
and the BytesIO
stream data can be accessed.现在 print 语句打印False
并且可以访问BytesIO
stream 数据。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.