简体   繁体   English

如何可逆地存储和加载 Pandas dataframe 到磁盘/从磁盘加载

[英]How to reversibly store and load a Pandas dataframe to/from disk

Right now I'm importing a fairly large CSV as a dataframe every time I run the script.现在我每次运行脚本时都会导入一个相当大的CSV作为 dataframe 。 Is there a good solution for keeping that dataframe constantly available in between runs so I don't have to spend all that time waiting for the script to run?是否有一个好的解决方案可以让 dataframe 在两次运行之间始终可用,这样我就不必花费所有时间等待脚本运行?

The easiest way is to pickle it using to_pickle :最简单的方法是将腌制用它to_pickle

df.to_pickle(file_name)  # where to save it, usually as a .pkl

Then you can load it back using:然后您可以使用以下方法加载它:

df = pd.read_pickle(file_name)

Note: before 0.11.1 save and load were the only way to do this (they are now deprecated in favor of to_pickle and read_pickle respectively).注意:在 0.11.1 之前, saveload是唯一的方法(它们现在已被弃用,分别支持to_pickleread_pickle )。


Another popular choice is to use HDF5 ( pytables ) which offers very fast access times for large datasets:另一个流行的选择是使用HDF5 ( pytables ),它为大型数据集提供非常快的访问时间:

import pandas as pd
store = pd.HDFStore('store.h5')

store['df'] = df  # save it
store['df']  # load it

More advanced strategies are discussed in the cookbook . 食谱中讨论了更高级的策略。


Since 0.13 there's also msgpack which may be be better for interoperability, as a faster alternative to JSON, or if you have python object/text-heavy data (see this question ).从 0.13 开始,还有msgpack可能更适合互操作性,作为 JSON 的更快替代品,或者如果您有 python 对象/文本重数据(请参阅此问题)。

Although there are already some answers I found a nice comparison in which they tried several ways to serialize Pandas DataFrames: Efficiently Store Pandas DataFrames .虽然已经有一些答案,但我找到了一个很好的比较,他们尝试了几种方法来序列化 Pandas DataFrames: Efficiently Store Pandas DataFrames

They compare:他们比较:

  • pickle: original ASCII data format pickle:原始ASCII数据格式
  • cPickle, a C library cPickle,一个 C 库
  • pickle-p2: uses the newer binary format pickle-p2:使用较新的二进制格式
  • json: standardlib json library json:standardlib json 库
  • json-no-index: like json, but without index json-no-index:类似于 json,但没有索引
  • msgpack: binary JSON alternative msgpack:二进制 JSON 替代方案
  • CSV CSV
  • hdfstore: HDF5 storage format hdfstore:HDF5 存储格式

In their experiment, they serialize a DataFrame of 1,000,000 rows with the two columns tested separately: one with text data, the other with numbers.在他们的实验中,他们序列化了一个包含 1,000,000 行的 DataFrame,其中两列分别测试:一列包含文本数据,另一列包含数字。 Their disclaimer says:他们的免责声明说:

You should not trust that what follows generalizes to your data.您不应该相信以下内容可以概括为您的数据。 You should look at your own data and run benchmarks yourself您应该查看自己的数据并自己运行基准测试

The source code for the test which they refer to is available online .他们所引用的测试的源代码可在线获得 Since this code did not work directly I made some minor changes, which you can get here: serialize.py I got the following results:由于这段代码不能直接工作,我做了一些小改动,你可以在这里得到: serialize.py我得到了以下结果:

时间对比结果

They also mention that with the conversion of text data to categorical data the serialization is much faster.他们还提到,通过将文本数据转换为分类数据,序列化速度要快得多。 In their test about 10 times as fast (also see the test code).在他们的测试中大约快 10 倍(另请参阅测试代码)。

Edit : The higher times for pickle than CSV can be explained by the data format used.编辑:pickle 比 CSV 更高的时间可以通过使用的数据格式来解释。 By default pickle uses a printable ASCII representation, which generates larger data sets.默认情况下, pickle使用可打印的 ASCII 表示,从而生成更大的数据集。 As can be seen from the graph however, pickle using the newer binary data format (version 2, pickle-p2 ) has much lower load times.然而,从图中可以看出,使用较新的二进制数据格式(版本 2, pickle-p2 )的pickle-p2加载时间要低得多。

Some other references:其他一些参考:

If I understand correctly, you're already using pandas.read_csv() but would like to speed up the development process so that you don't have to load the file in every time you edit your script, is that right?如果我理解正确,您已经在使用pandas.read_csv()但希望加快开发过程,这样您就不必每次编辑脚本时都加载文件,对吗? I have a few recommendations:我有几点建议:

  1. you could load in only part of the CSV file using pandas.read_csv(..., nrows=1000) to only load the top bit of the table, while you're doing the development您可以使用pandas.read_csv(..., nrows=1000)仅加载 CSV 文件的一部分以仅加载表的顶部位,同时进行开发

  2. use ipython for an interactive session, such that you keep the pandas table in memory as you edit and reload your script.ipython用于交互式会话,以便在编辑和重新加载脚本时将Pandas表保留在内存中。

  3. convert the csv to an HDF5 table将 csv 转换为HDF5 表

  4. updated use DataFrame.to_feather() and pd.read_feather() to store data in the R-compatible feather binary format that is super fast (in my hands, slightly faster than pandas.to_pickle() on numeric data and much faster on string data).更新使用DataFrame.to_feather()pd.read_feather()pd.read_feather()的 R 兼容羽毛二进制格式存储数据(在我手中,在数字数据上比pandas.to_pickle()略快,在字符串数据上快得多)。

You might also be interested in this answer on stackoverflow.您可能也对 stackoverflow 上的这个答案感兴趣。

Pickle works good!泡菜效果很好!

import pandas as pd
df.to_pickle('123.pkl')    #to save the dataframe, df to 123.pkl
df1 = pd.read_pickle('123.pkl') #to load 123.pkl back to the dataframe df

You can use feather format file.您可以使用羽毛格式文件。 It is extremely fast.它非常快。

df.to_feather('filename.ft')

As already mentioned there are different options and file formats ( HDF5 , JSON , CSV , parquet , SQL ) to store a data frame.如前所述,有不同的选项和文件格式( HDF5JSONCSVparquetSQL )来存储数据框。 However, pickle is not a first-class citizen (depending on your setup), because:但是, pickle不是一等公民(取决于您的设置),因为:

  1. pickle is a potential security risk. pickle是一种潜在的安全风险。 Form the Python documentation for pickle :形成picklePython 文档

Warning The pickle module is not secure against erroneous or maliciously constructed data.警告pickle模块对于错误或恶意构造的数据并不安全。 Never unpickle data received from an untrusted or unauthenticated source.永远不要解开从不受信任或未经身份验证的来源收到的数据。

  1. pickle is slow. pickle很慢。 Find here and here benchmarks. 在这里这里找到基准。

Depending on your setup/usage both limitations do not apply, but I would not recommend pickle as the default persistence for pandas data frames.根据您的设置/使用情况,这两个限制都不适用,但我不建议将pickle作为 Pandas 数据帧的默认持久性。

Pandas DataFrames have the to_pickle function which is useful for saving a DataFrame: Pandas DataFrames 有to_pickle函数,这对于保存 DataFrame 很有用:

import pandas as pd

a = pd.DataFrame({'A':[0,1,0,1,0],'B':[True, True, False, False, False]})
print a
#    A      B
# 0  0   True
# 1  1   True
# 2  0  False
# 3  1  False
# 4  0  False

a.to_pickle('my_file.pkl')

b = pd.read_pickle('my_file.pkl')
print b
#    A      B
# 0  0   True
# 1  1   True
# 2  0  False
# 3  1  False
# 4  0  False

Numpy file formats are pretty fast for numerical data Numpy 文件格式对于数值数据来说非常快

I prefer to use numpy files since they're fast and easy to work with.我更喜欢使用 numpy 文件,因为它们快速且易于使用。 Here's a simple benchmark for saving and loading a dataframe with 1 column of 1million points.这是保存和加载具有 1 列 100 万个点的数据帧的简单基准。

import numpy as np
import pandas as pd

num_dict = {'voltage': np.random.rand(1000000)}
num_df = pd.DataFrame(num_dict)

using ipython's %%timeit magic function使用 ipython 的%%timeit魔法函数

%%timeit
with open('num.npy', 'wb') as np_file:
    np.save(np_file, num_df)

the output is输出是

100 loops, best of 3: 5.97 ms per loop

to load the data back into a dataframe将数据加载回数据帧

%%timeit
with open('num.npy', 'rb') as np_file:
    data = np.load(np_file)

data_df = pd.DataFrame(data)

the output is输出是

100 loops, best of 3: 5.12 ms per loop

NOT BAD!不错!

CONS缺点

There's a problem if you save the numpy file using python 2 and then try opening using python 3 (or vice versa).如果您使用 python 2 保存 numpy 文件,然后尝试使用 python 3 打开(反之亦然),则会出现问题。

https://docs.python.org/3/library/pickle.html https://docs.python.org/3/library/pickle.html

The pickle protocol formats: pickle 协议格式:

Protocol version 0 is the original “human-readable” protocol and is backwards compatible with earlier versions of Python.协议版本 0 是原始的“人类可读”协议,向后兼容早期版本的 Python。

Protocol version 1 is an old binary format which is also compatible with earlier versions of Python.协议版本 1 是一种旧的二进制格式,它也与早期版本的 Python 兼容。

Protocol version 2 was introduced in Python 2.3.协议版本 2 是在 Python 2.3 中引入的。 It provides much more efficient pickling of new-style classes.它提供了更有效的新型类酸洗。 Refer to PEP 307 for information about improvements brought by protocol 2.有关协议 2 带来的改进的信息,请参阅 PEP 307。

Protocol version 3 was added in Python 3.0. Python 3.0 中添加了协议版本 3。 It has explicit support for bytes objects and cannot be unpickled by Python 2.x.它对字节对象有明确的支持,并且不能被 Python 2.x 取消。 This is the default protocol, and the recommended protocol when compatibility with other Python 3 versions is required.这是默认协议,当需要兼容其他 Python 3 版本时推荐使用的协议。

Protocol version 4 was added in Python 3.4. Python 3.4 中添加了协议版本 4。 It adds support for very large objects, pickling more kinds of objects, and some data format optimizations.它增加了对超大对象的支持,酸洗更多种类的对象,以及一些数据格式优化。 Refer to PEP 3154 for information about improvements brought by protocol 4.有关协议 4 带来的改进的信息,请参阅 PEP 3154。

Another quite fresh test with to_pickle() .另一个非常新鲜的to_pickle()测试。

I have 25 .csv files in total to process and the final dataframe consists of roughly 2M items.我总共有25 个.csv文件需要处理,最终的dataframe包含大约200 万个项目。

(Note: Besides loading the .csv files, I also manipulate some data and extend the data frame by new columns.) (注意:除了加载 .csv 文件,我还操作一些数据并通过新列扩展数据框。)

Going through all 25 .csv files and create the dataframe takes around 14 sec .浏览所有25 个.csv文件并创建数据框大约需要14 sec

Loading the whole dataframe from a pkl file takes less than 1 secpkl文件加载整个数据帧需要不到1 sec

Arctic is a high performance datastore for Pandas, numpy and other numeric data. Arctic是 Pandas、numpy 和其他数字数据的高性能数据存储。 It sits on top of MongoDB.它位于 MongoDB 之上。 Perhaps overkill for the OP, but worth mentioning for other folks stumbling across this post对于 OP 来说可能有点矫枉过正,但值得一提的是其他在这篇文章中绊倒的人

import pickle

example_dict = {1:"6",2:"2",3:"g"}

pickle_out = open("dict.pickle","wb")
pickle.dump(example_dict, pickle_out)
pickle_out.close()

The above code will save the pickle file 上面的代码将保存pickle文件

pickle_in = open("dict.pickle","rb")
example_dict = pickle.load(pickle_in)

This two lines will open the saved pickle file 这两行将打开保存的泡菜文件

pyarrow compatibility across versions pyarrow 跨版本兼容性

Overall move has been to pyarrow/feather (deprecation warnings from pandas/msgpack).总体转移到 pyarrow/feather(来自 pandas/msgpack 的弃用警告)。 However I have a challenge with pyarrow with transient in specification Data serialized with pyarrow 0.15.1 cannot be deserialized with 0.16.0 ARROW-7961 .但是,我对 pyarrow 有一个挑战,在规范中具有瞬态数据用 pyarrow 0.15.1 序列化不能用 0.16.0 ARROW-7961反序列化。 I'm using serialization to use redis so have to use a binary encoding.我正在使用序列化来使用 redis,因此必须使用二进制编码。

I've retested various options (using jupyter notebook)我重新测试了各种选项(使用 jupyter notebook)

import sys, pickle, zlib, warnings, io
class foocls:
    def pyarrow(out): return pa.serialize(out).to_buffer().to_pybytes()
    def msgpack(out): return out.to_msgpack()
    def pickle(out): return pickle.dumps(out)
    def feather(out): return out.to_feather(io.BytesIO())
    def parquet(out): return out.to_parquet(io.BytesIO())

warnings.filterwarnings("ignore")
for c in foocls.__dict__.values():
    sbreak = True
    try:
        c(out)
        print(c.__name__, "before serialization", sys.getsizeof(out))
        print(c.__name__, sys.getsizeof(c(out)))
        %timeit -n 50 c(out)
        print(c.__name__, "zlib", sys.getsizeof(zlib.compress(c(out))))
        %timeit -n 50 zlib.compress(c(out))
    except TypeError as e:
        if "not callable" in str(e): sbreak = False
        else: raise
    except (ValueError) as e: print(c.__name__, "ERROR", e)
    finally: 
        if sbreak: print("=+=" * 30)        
warnings.filterwarnings("default")

With following results for my data frame (in out jupyter variable)随着以下对我的数据帧的结果(在out jupyter变量)

pyarrow before serialization 533366
pyarrow 120805
1.03 ms ± 43.9 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
pyarrow zlib 20517
2.78 ms ± 81.8 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
msgpack before serialization 533366
msgpack 109039
1.74 ms ± 72.8 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
msgpack zlib 16639
3.05 ms ± 71.7 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
pickle before serialization 533366
pickle 142121
733 µs ± 38.3 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
pickle zlib 29477
3.81 ms ± 60.4 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
feather ERROR feather does not support serializing a non-default index for the index; you can .reset_index() to make the index into column(s)
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
parquet ERROR Nested column branch had multiple children: struct<x: double, y: double>
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=

feather and parquet do not work for my data frame.羽毛和镶木地板不适用于我的数据框。 I'm going to continue using pyarrow.我将继续使用pyarrow。 However I will supplement with pickle (no compression).不过我会补充泡菜(无压缩)。 When writing to cache store pyarrow and pickle serialised forms.写入缓存存储 pyarrow 和 pickle 序列化表单时。 When reading from cache fallback to pickle if pyarrow deserialisation fails.如果 pyarrow 反序列化失败,则从缓存回退读取到 pickle 时。

A lot of great and sufficient answers here, but I would like to publish a test that I used on Kaggle, which large df is saved and read by different pandas compatible formats:这里有很多很棒和足够的答案,但我想发布一个我在 Kaggle 上使用的测试,其中大 df 由不同的 Pandas 兼容格式保存和读取:

https://www.kaggle.com/pedrocouto39/fast-reading-w-pickle-feather-parquet-jay https://www.kaggle.com/pedrocouto39/fast-reading-w-pickle-feather-parquet-jay

I'm not the author or friend of author of this, hovewer, when I read this question I think it's worth mentioning there.我不是作者或作者的朋友,但是,当我读到这个问题时,我认为在那里值得一提。

CSV: 1min 42s Pickle: 4.45s Feather: 4.35s Parquet: 8.31s Jay: 8.12ms or 0.0812s (blazing fast!) CSV:1 分钟 42 秒泡菜:4.45 秒羽毛:4.35 秒镶木地板:8.31 秒 Jay:8.12 毫秒或 0.0812 秒(极快!)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM