[英]How to save a list in a pandas dataframe cell to a HDF5 table format?
I have a dataframe that I want to save in the appendable format to a hdf5 file.我有一个 dataframe,我想以附加格式保存到 hdf5 文件中。 The dataframe looks like this: dataframe 看起来像这样:
column1
0 [0, 1, 2, 3, 4]
And the code that replicates the issue is:复制该问题的代码是:
import pandas as pd
test = pd.DataFrame({"column1":[list(range(0,5))]})
test.to_hdf('test','testgroup',format="table")
Unfortunately, it returns this error:不幸的是,它返回此错误:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-65-c2dbeaca15df> in <module>
1 test = pd.DataFrame({"column1":[list(range(0,5))]})
----> 2 test.to_hdf('test','testgroup',format="table")
7 frames
/usr/local/lib/python3.7/dist-packages/pandas/io/pytables.py in _maybe_convert_for_string_atom(name, block, existing_col, min_itemsize, nan_rep, encoding, errors, columns)
4979 error_column_label = columns[i] if len(columns) > i else f"No.{i}"
4980 raise TypeError(
-> 4981 f"Cannot serialize the column [{error_column_label}]\n"
4982 f"because its data contents are not [string] but "
4983 f"[{inferred_type}] object dtype"
TypeError: Cannot serialize the column [column1]
because its data contents are not [string] but [mixed] object dtype
I am aware that I can save each value in a separate column.我知道我可以将每个值保存在单独的列中。 This does not help my extended use case, as there might be variable length lists.这对我的扩展用例没有帮助,因为可能有可变长度的列表。
I know I could convert the list to a string and then recreate it based on the string, but if I start converting each column to string, I might as well use a text format, like csv, instead of a binary one like hdf5.我知道我可以将列表转换为字符串,然后根据字符串重新创建它,但如果我开始将每一列转换为字符串,我还不如使用文本格式,如 csv,而不是像 hdf5 这样的二进制格式。
Is there a standard way of saving lists into hdf5 table format?是否有将列表保存为 hdf5 表格格式的标准方法?
Python Lists present a challenge when writing to HDF5 because they may contain different types. Python 列表在写入 HDF5 时提出了挑战,因为它们可能包含不同的类型。 For example, this is a perfectly valid list: [1, 'two', 3.0]
.例如,这是一个完全有效的列表: [1, 'two', 3.0]
。 Also, if I understand your Pandas 'column1'
dataframe, it may contain different length lists.另外,如果我理解你的 Pandas 'column1'
dataframe,它可能包含不同长度的列表。 There is no (simple) way to represent this as an HDF5 dataset.没有(简单的)方法可以将其表示为 HDF5 数据集。 [That's why you got the [mixed] object dtype
message. [这就是为什么您收到[mixed] object dtype
消息的原因。 The conversion of the dataframe creates an intermediate object that is written as a dataset. dataframe 的转换创建了一个中间值 object,它被写为一个数据集。 The dtype of the converted list data is "O" (object), and HDF5 doesn't support this type.]转换后的列表数据的dtype为“O”(object),HDF5不支持该类型。]
However, all is not lost.然而,一切并没有丢失。 If we can make some assumptions about your data, we can wrangle it into a HDF5 dataset.如果我们可以对您的数据做出一些假设,我们可以将其整理成 HDF5 数据集。 Assumptions: 1) all df list entities are the same type (int in this case), and 2) all df lists are the same length.假设:1) 所有 df 列表实体都是相同类型(在本例中为 int),以及 2) 所有 df 列表的长度相同。 (We can handle different length lists, but it is more complicated.) Also, you will need to use a different package to write the HDF5 data (either PyTables or h5py). (我们可以处理不同长度的列表,但它更复杂。)此外,您将需要使用不同的 package 来写入 HDF5 数据(PyTables 或 h5py)。 PyTables is the underlying package for Pandas HDF5 support and h5py is widely used. PyTables 是 package 的底层 Pandas HDF5 支持,h5py 被广泛使用。 The choice is yours.这是你的选择。
Before I post the code, here is an outline of the process:在我发布代码之前,这里是一个过程的概述:
Code to create recarray (adds 2 rows to your dataframe):创建 recarray 的代码(向数据框添加 2 行):
import pandas as pd
test = pd.DataFrame({"column1":[list(range(0,5)), list(range(10,15)), list(range(100,105))]})
# create recarray from the dataframe (use index='column1' to only get that column)
rec_arr = test.to_records(index=False)
PyTables specific code to export data: PyTables 导出数据的具体代码:
import tables as tb
with tb.File('74489101_tb.h5', 'w') as h5f:
# define "atom" with type and shape of column1 data
df_atom = tb.Atom.from_type('int32', shape=(len(rec_arr[0]['column1']),) )
# create the dataset
test = h5f.create_array('/','test', shape=rec_arr.shape, atom=df_atom )
# loop over recarray and polulate dataset
for i in range(rec_arr.shape[0]):
test[i] = rec_arr[i]['column1']
print(test[:])
h5py specific code to export data: h5py导出数据的具体代码:
import h5py
with h5py.File('74489101_h5py.h5', 'w') as h5f:
df_dt = (int,(len(rec_arr1[0]['column1']),))
test = h5f.create_dataset('test', shape=rec_arr1.shape, dtype=df_dt )
for i in range(rec_arr1.shape[0]):
test[i] = rec_arr1[i]['column1']
print(test[:])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.