如何从 Pandas HDF 存储中读取 nrows？

Question

我想做什么？

pd.read_csv(... nrows=###)可以读取文件的顶部 nrows。 我想在使用pd.read_hdf(...)时做同样的pd.read_hdf(...) 。

有什么问题？

我对文档感到困惑。 start和stop看起来像我需要的，但是当我尝试它时，会返回一个ValueError 。 我尝试的第二件事是使用nrows=10认为它可能是允许的**kwargs 。 当我这样做时，不会抛出任何错误，而且会返回完整的数据集，而不仅仅是 10 行。

问题：如何从 HDF 文件中正确读取较小的行子集？ （编辑：无需先将整个内容读入内存！）

下面是我的互动环节：

>>> import pandas as pd
>>> df = pd.read_hdf('storage.h5')
Traceback (most recent call last):
  File "<pyshell#1>", line 1, in <module>
    df = pd.read_hdf('storage.h5')
  File "C:\Python35\lib\site-packages\pandas\io\pytables.py", line 367, in read_hdf
    raise ValueError('key must be provided when HDF5 file '
ValueError: key must be provided when HDF5 file contains multiple datasets.
>>> import h5py
>>> f = h5py.File('storage.h5', mode='r')
>>> list(f.keys())[0]
'table'
>>> f.close()
>>> df = pd.read_hdf('storage.h5', key='table', start=0, stop=10)
Traceback (most recent call last):
  File "<pyshell#6>", line 1, in <module>
    df = pd.read_hdf('storage.h5', key='table', start=0, stop=10)
  File "C:\Python35\lib\site-packages\pandas\io\pytables.py", line 370, in read_hdf
    return store.select(key, auto_close=auto_close, **kwargs)
  File "C:\Python35\lib\site-packages\pandas\io\pytables.py", line 740, in select
    return it.get_result()
  File "C:\Python35\lib\site-packages\pandas\io\pytables.py", line 1447, in get_result
    results = self.func(self.start, self.stop, where)
  File "C:\Python35\lib\site-packages\pandas\io\pytables.py", line 733, in func
    columns=columns, **kwargs)
  File "C:\Python35\lib\site-packages\pandas\io\pytables.py", line 2890, in read
    return self.obj_type(BlockManager(blocks, axes))
  File "C:\Python35\lib\site-packages\pandas\core\internals.py", line 2795, in __init__
    self._verify_integrity()
  File "C:\Python35\lib\site-packages\pandas\core\internals.py", line 3006, in _verify_integrity
    construction_error(tot_items, block.shape[1:], self.axes)
  File "C:\Python35\lib\site-packages\pandas\core\internals.py", line 4280, in construction_error
    passed, implied))
ValueError: Shape of passed values is (614, 593430), indices imply (614, 10)
>>> df = pd.read_hdf('storage.h5', key='table', nrows=10)
>>> df.shape
(593430, 614)

编辑：

我只是尝试使用where ：

mylist = list(range(30))
df = pd.read_hdf('storage.h5', key='table', where='index=mylist')

收到一个 TypeError 指示固定格式存储（ df.to_hdf(...)的默认format值）：

TypeError: cannot pass a where specification when reading from a
  Fixed format store. this store must be selected in its entirety

这是否意味着如果格式是固定格式，我不能选择行的子集？

Answer 1

我遇到了同样的问题。 我现在很确定https://github.com/pandas-dev/pandas/issues/11188跟踪了这个问题。 这是一张 2015 年的票，里面有一张复制品。 Jeff Reback 认为这实际上是一个错误，他甚至在 2015 年就向我们指出了一个解决方案。只是还没有人构建该解决方案。 我可以试试。

Answer 2

似乎现在可以使用了，至少在 pandas 1.0.1 中是这样。 只需提供start和stop参数：

df = pd.read_hdf('test.h5', '/floats/trajectories', start=0, stop=5)

如何从 Pandas HDF 存储中读取 nrows？

问题描述

2 个解决方案

解决方案1
0 2019-06-12 17:51:17

解决方案2
0 2020-03-26 16:18:29

如何从 Pandas HDF 存储中读取 nrows？

问题描述

2 个解决方案

解决方案1 0 2019-06-12 17:51:17

解决方案2 0 2020-03-26 16:18:29

解决方案1
0 2019-06-12 17:51:17

解决方案2
0 2020-03-26 16:18:29