简体   繁体   English

如何在hdf5中高效保存python pandas数据帧并将其作为R中的数据帧打开?

[英]How can I efficiently save a python pandas dataframe in hdf5 and open it as a dataframe in R?

I think the title covers the issue, but to elucidate: 我认为标题涵盖了这个问题,但要阐明:

The pandas python package has a DataFrame data type for holding table data in python. pandas python包有一个DataFrame数据类型,用于在python中保存表数据。 It also has a convenient interface to the hdf5 file format, so pandas DataFrames (and other data) can be saved using a simple dict-like interface (assuming you have pytables installed) 它还有一个方便的hdf5文件格式接口,所以pandas DataFrames(和其他数据)可以使用简单的类似dict的界面保存(假设你安装了pytables

import pandas 
import numpy
d = pandas.HDFStore('data.h5')
d['testdata'] = pandas.DataFrame({'N': numpy.random.randn(5)})
d.close()

So far so good. 到现在为止还挺好。 However, if I then try to load that same hdf5 into RI see things aren't so simple: 但是,如果我然后尝试将相同的hdf5加载到RI中,请看事情并非如此简单:

> library(hdf5)
> hdf5load('data.h5')
NULL
> testdata
$block0_values
         [,1]      [,2]      [,3]       [,4]      [,5]
[1,] 1.498147 0.8843877 -1.081656 0.08717049 -1.302641
attr(,"CLASS")
[1] "ARRAY"
attr(,"VERSION")
[1] "2.3"
attr(,"TITLE")
[1] ""
attr(,"FLAVOR")
[1] "numpy"

$block0_items
[1] "N"
attr(,"CLASS")
[1] "ARRAY"
attr(,"VERSION")
[1] "2.3"
attr(,"TITLE")
[1] ""
attr(,"FLAVOR")
[1] "numpy"
attr(,"kind")
[1] "string"
attr(,"name")
[1] "N."

$axis1
[1] 0 1 2 3 4
attr(,"CLASS")
[1] "ARRAY"
attr(,"VERSION")
[1] "2.3"
attr(,"TITLE")
[1] ""
attr(,"FLAVOR")
[1] "numpy"
attr(,"kind")
[1] "integer"
attr(,"name")
[1] "N."

$axis0
[1] "N"
attr(,"CLASS")
[1] "ARRAY"
attr(,"VERSION")
[1] "2.3"
attr(,"TITLE")
[1] ""
attr(,"FLAVOR")
[1] "numpy"
attr(,"kind")
[1] "string"
attr(,"name")
[1] "N."

attr(,"TITLE")
[1] ""
attr(,"CLASS")
[1] "GROUP"
attr(,"VERSION")
[1] "1.0"
attr(,"ndim")
[1] 2
attr(,"axis0_variety")
[1] "regular"
attr(,"axis1_variety")
[1] "regular"
attr(,"nblocks")
[1] 1
attr(,"block0_items_variety")
[1] "regular"
attr(,"pandas_type")
[1] "frame"

Which brings me to my question: ideally I would be able to save back and forth from R to pandas. 这让我想到了一个问题:理想情况下,我可以从R来回保存到熊猫。 I can obviously write a wrapper from pandas to R (I think... though I think if I use a pandas MultiIndex that might become trickier), but I don't think I can easily then use that data back in pandas. 我显然可以写一个从熊猫到R的包装器(我想......虽然我认为如果我使用可能变得更加棘手的pandas MultiIndex ),但我认为我不能轻易地将这些数据用在熊猫中。 Any suggestions? 有什么建议?

Bonus: what I really want to do is use the data.table package in R with a pandas dataframe (the keying approach is suspiciously similar in both packages). 额外奖励:我真正想要做的是使用带有pandas数据帧的R中的data.table包(两种包中的键控方法都非常相似)。 Any help on that one greatly appreciated. 对那个人的任何帮助都非常感谢。

If you are still looking at this, take a look at this post on google groups. 如果您仍在查看此内容,请查看Google论坛上的这篇文章。 It shows how to exchange data between pandas/R via HDF5. 它显示了如何通过HDF5在pandas / R之间交换数据。

https://groups.google.com/forum/?fromgroups#!topic/pydata/0LR72GN9p6w https://groups.google.com/forum/?fromgroups#!topic/pydata/0LR72GN9p6w

It would make sense to dropdown to pytables and store/get your data there. 下拉到pytables并存储/获取数据是有意义的。

Ultimately a DataFrame is a dict of Series which is what an HDF5 Table is. 最终,DataFrame是HDF5表的系列字典。 There are limitations on the translation due to incompatible dtypes but for numerical data it should be straight forward. 由于不兼容的dtypes,翻译存在限制,但对于数值数据,它应该是直截了当的。

The way pandas stores its HDF5 is viewed more like a binary blob. 大熊猫存储其HDF5的方式更像是二进制blob。 It has to support all the nuances of a DataFrame which HDF5 does support cleanly. 它必须支持HDF5干净支持的DataFrame的所有细微差别。

https://github.com/dalejung/trtools/blob/master/trtools/io/pytables.py https://github.com/dalejung/trtools/blob/master/trtools/io/pytables.py

Has some that kind of pandas/hdf5 munging code. 有一些那种pandas / hdf5的代码。

如何在HDF5中编写数据帧,以便可以在R中读取它现在在Pandas文档中: http//pandas-docs.github.io/pandas-docs-travis/io.html#external-compatibility

I recommend using feather , built by Wes and Hadley to solve the problem of transferring data between R and Python efficiently. 我建议使用由Wes和Hadley构建的feather来解决在R和Python之间有效传输数据的问题。

Python 蟒蛇

import numpy as np
import pandas as pd
import feather as ft

df = pd.DataFrame({'N': np.random.randn(5)})
ft.write_dataframe(df, 'df.feather')

R [R

library(data.table)
library(feather)

dt <- data.table(read_feather("df.feather"))
dt
           N
1: 0.2777700
2: 1.4083377
3: 1.2940691
4: 0.8221348
5: 1.8552908

You could use csv files as the common data format. 您可以使用csv文件作为通用数据格式。 Both R and python pandas can easily work with that. R和python pandas都可以很容易地使用它。 You might lose some precision, but if this is a problem depends on your specific problem. 您可能会失去一些精确度,但如果这是一个问题取决于您的具体问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM