简体   繁体   English

熊猫to_hdf成功,但随后read_hdf失败

[英]Pandas to_hdf succeeds but then read_hdf fails

Pandas to_hdf succeeds but then read_hdf fails when I use custom objects as column headers (I use custom objects because I need to store other info in them). 熊猫to_hdf成功,但是当我使用自定义对象作为列标题时, read_hdf失败(我使用自定义对象,因为我需要在其中存储其他信息)。

Is there some way to make this work? 有什么办法可以使这项工作吗? Or is this just a Pandas bug or PyTables bug? 还是只是Pandas错误或PyTables错误?

As an example, below, I will show first making a DataFrame foo that uses string column headers, and everything works fine with to_hdf / read_hdf , but then changing foo to use a custom Col class for column headers, to_hdf still works fine but then read_hdf raises assertion error: 例如,在下面的示例中,我将首先显示一个使用字符串列标题的DataFrame foo ,并且一切都可以通过to_hdf / read_hdf ,但随后将foo更改为将自定义Col类用于列标题, to_hdf仍然可以正常工作,但随后read_hdf引发断言错误:

In [48]: foo = pd.DataFrame(np.random.randn(2, 3), columns = ['aaa', 'bbb', 'ccc'])

In [49]: foo
Out[49]: 
    aaa       bbb       ccc
0 -0.434303  0.174689  1.373971
1 -0.562228  0.862092 -1.361979

In [50]: foo.to_hdf('foo.h5', 'foo')

In [51]: bar = pd.read_hdf('foo.h5', 'foo')

In [52]: bar
Out[52]: 
    aaa       bbb       ccc
0 -0.434303  0.174689  1.373971
1 -0.562228  0.862092 -1.361979

In [52]: 

In [53]: class Col(object):
...:     def __init__(self, name, other_info):
...:         self.name = name
...:         self.other_info = other_info
...:     def __str__(self):
...:         return self.name
...:     

In [54]: foo = pd.DataFrame(np.random.randn(2, 3), columns = [Col('aaa', {'z': 5}), Col('bbb', {'y': True}), Col('ccc', {})])

In [55]: foo
Out[55]: 
    aaa       bbb       ccc
0 -0.830503  1.066178  1.057349
1  0.406967 -0.131430  1.970204

In [56]: foo.to_hdf('foo.h5', 'foo')

In [57]: bar = pd.read_hdf('foo.h5', 'foo')
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-57-888b061a1d2c> in <module>()
----> 1 bar = pd.read_hdf('foo.h5', 'foo')

/.../python3.4/site-packages/pandas/io/pytables.py in read_hdf(path_or_buf, key, **kwargs)
330 
331     try:
--> 332         return store.select(key, auto_close=auto_close, **kwargs)
333     except:
334         # if there is an error, close the store

/.../python3.4/site-packages/pandas/io/pytables.py in select(self, key, where, start, stop, columns, iterator, chunksize, auto_close, **kwargs)
672                            auto_close=auto_close)
673 
--> 674         return it.get_result()
675 
676     def select_as_coordinates(

/.../python3.4/site-packages/pandas/io/pytables.py in get_result(self, coordinates)
   1366 
   1367         # directly return the result
-> 1368         results = self.func(self.start, self.stop, where)
   1369         self.close()
   1370         return results

/.../python3.4/site-packages/pandas/io/pytables.py in func(_start, _stop, _where)
665             return s.read(start=_start, stop=_stop,
666                           where=_where,
--> 667                           columns=columns, **kwargs)
668 
669         # create the iterator

/.../python3.4/site-packages/pandas/io/pytables.py in read(self, **kwargs)
   2792             blocks.append(blk)
   2793 
-> 2794         return self.obj_type(BlockManager(blocks, axes))
   2795 
   2796     def write(self, obj, **kwargs):

/.../python3.4/site-packages/pandas/core/internals.py in __init__(self, blocks, axes, do_integrity_check, fastpath)
   2180         self._consolidate_check()
   2181 
-> 2182         self._rebuild_blknos_and_blklocs()
   2183 
   2184     def make_empty(self, axes=None):

/.../python3.4/site-packages/pandas/core/internals.py in _rebuild_blknos_and_blklocs(self)
   2271 
   2272         if (new_blknos == -1).any():
-> 2273             raise AssertionError("Gaps in blk ref_locs")
   2274 
   2275         self._blknos = new_blknos

AssertionError: Gaps in blk ref_locs

UPDATE : 更新

So Jeff answered (a) "this is not supported" and (b) "if you have meta-data then write it to the attributes". 因此,Jeff回答了(a)“不支持此功能”和(b)“如果有元数据,则将其写入属性”。

Question 1 regarding (a): My column header objects have methods to return their properties, etc. For example, instead of a column header string 'x5y3z8' where I would have to parse out the values, I can simply do col_header.x (gives 5) col_header.y (gives 3) etc. This is very object-oriented and pythonic, instead of using a string to store info and having to parse it every time to retrieve info. 关于(a)的问题1:我的列标题对象具有返回其属性的方法,等等。例如,代替我必须解析出值的列标题字符串'x5y3z8',我可以简单地执行col_header.x(给出5)col_header.y(给出3)等。这是非常面向对象的和pythonic的,而不是使用字符串来存储信息,并且每次都必须解析它来检索信息。 How do you suggest I replace my current column header objects in a nice way (that's also supported)? 您如何建议以一种不错的方式替换当前的列标题对象(也支持)?

(BTW, you might look at 'x5y3z8' and think hierarchical index works, but that is not the case because not every column header is 'x#y#z#'. I might have one column 'foo' of strings, another one 'bar5baz7' of ints, and another 'x5y3z8' of floats. The column headers aren't uniform.) (顺便说一句,您可能会看'x5y3z8'并认为层次结构索引有效,但事实并非如此,因为并非每个列标题都是'x#y#z#'。我可能有一串字符串'foo',另一列是字符串ints的“ bar5baz7”和浮点数的另一个“ x5y3z8”。列标题不一致。)

Question 2 regarding (a): When you say it's not supported, are you specifically talking about to_hdf/read_hdf not supporting it, or are you actually saying that Pandas in general doesn't support it? 关于(a)的问题2:当您说它不被支持时,您是专门在谈论to_hdf / read_hdf不支持它,还是您实际上是在说熊猫一般不支持它? If it's only the HDF5 support that's missing, then I could switch to some other way of saving the DataFrames to disk and have it work, right? 如果仅缺少HDF5支持,那么我可以切换到其他将DataFrame保存到磁盘并使它工作的方法,对吗? Do you foresee any problems with that in the future? 您预见到将来会出现任何问题吗? Will this ever break with to_pickle/read_pickle, for example? 例如,这是否会与to_pickle / read_pickle一起打破? (I lose performance, but got to give up something, right?) (我失去了表现,但不得不放弃一些,对吧?)

Question 3 regarding (b): What do you mean by "if you have meta-data then write it to the attributes". 关于(b)的问题3:“如果您有元数据,则将其写入属性”是什么意思。 Attributes of what? 属性是什么? A simple example would help me a lot. 一个简单的例子将对我有很大帮助。 I'm pretty new to Pandas. 我是熊猫的新手。 Thanks! 谢谢!

This is not a supported feature. 这不是受支持的功能。

This will raise in the next version of pandas (on the writing), for format='table' . 这将在下一个版本的熊猫(写作中)中以format='table' Should for fixed as well, but that's not implemented. 也应该fixed ,但是没有实现。 This is simply not supported, nor likely to be. 根本不支持,也不可能支持。 You should just use strings. 您应该只使用字符串。 If you have meta-data then write it to the attributes. 如果您有元数据,则将其写入属性。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM