简体   繁体   English

缓存熊猫数据框的最佳方法?

[英]Best way to cache a pandas dataframe?

Yesterday I learned the hard way that saving a pandas dataframe to csv for later use is a bad idea.昨天我学会了将 Pandas 数据帧保存到 csv 以备后用的艰难方法。 I have a dataframe of +- 130k tweets, where one row of the dataframe is a list of tweets.我有一个 +- 130k 推文的数据框,其中数据框的一行是推文列表 When I saved the data to CSV and then loaded the dataframe back in, the rows of my dataframes are now of type String.当我将数据保存到 CSV 然后重新加载数据框时,我的数据框的行现在是字符串类型。 This lead to all kinds of errors and a lot of debugging.这会导致各种错误和大量调试。 Of course it was a stupid mistake to assume that CSV would be able to preserve information about which data structure type my data is.当然,假设 CSV 能够保留有关我的数据是哪种数据结构类型的信息是一个愚蠢的错误。

My question now is: How do I save a dataframe for later use, in a way that information about which data types my columns/rows are is preserved ?我现在的问题是:如何保存数据框以备后用,以某种方式保留有关我的列/行的数据类型的信息?

I hope you found the solution you were looking for.我希望你找到了你正在寻找的解决方案。
To answer the question, one can use the DataFrame.to_pickle() method to serialize (convert python objects into byte streams), and when you de-serialize a pickle file, you get back the data as they were, but keep in mind when using pickle files, they may pose a security threat when received from untrusted sources.要回答这个问题,可以使用DataFrame.to_pickle()方法进行序列化(将 python 对象转换为字节流),并且当您反序列化一个 pickle 文件时,您会恢复原样的数据,但请记住使用 pickle 文件时,它们可能会在从不受信任的来源接收时构成安全威胁。

Here's an example from the doc on how to use pickle:这是文档中有关如何使用泡菜的示例:

>>> original_df = pd.DataFrame({"foo": range(5), "bar": range(5, 10)})
>>> original_df
   foo  bar
0    0    5
1    1    6
2    2    7
3    3    8
4    4    9

>>> pd.to_pickle(original_df, "./dummy.pkl")
>>> unpickled_df = pd.read_pickle("./dummy.pkl")
>>> unpickled_df
   foo  bar
0    0    5
1    1    6
2    2    7
3    3    8
4    4    9

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM