简体   繁体   English

如何将 pandas dataframe 单元格中的列表保存为 HDF5 表格格式?

[英]How to save a list in a pandas dataframe cell to a HDF5 table format?

I have a dataframe that I want to save in the appendable format to a hdf5 file.我有一个 dataframe,我想以附加格式保存到 hdf5 文件中。 The dataframe looks like this: dataframe 看起来像这样:

    column1
0   [0, 1, 2, 3, 4]

And the code that replicates the issue is:复制该问题的代码是:

import pandas as pd
test = pd.DataFrame({"column1":[list(range(0,5))]})
test.to_hdf('test','testgroup',format="table")

Unfortunately, it returns this error:不幸的是,它返回此错误:

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-65-c2dbeaca15df> in <module>
      1 test = pd.DataFrame({"column1":[list(range(0,5))]})
----> 2 test.to_hdf('test','testgroup',format="table")

7 frames

/usr/local/lib/python3.7/dist-packages/pandas/io/pytables.py in _maybe_convert_for_string_atom(name, block, existing_col, min_itemsize, nan_rep, encoding, errors, columns)
   4979                 error_column_label = columns[i] if len(columns) > i else f"No.{i}"
   4980                 raise TypeError(
-> 4981                     f"Cannot serialize the column [{error_column_label}]\n"
   4982                     f"because its data contents are not [string] but "
   4983                     f"[{inferred_type}] object dtype"

TypeError: Cannot serialize the column [column1]
because its data contents are not [string] but [mixed] object dtype

I am aware that I can save each value in a separate column.我知道我可以将每个值保存在单独的列中。 This does not help my extended use case, as there might be variable length lists.这对我的扩展用例没有帮助,因为可能有可变长度的列表。

I know I could convert the list to a string and then recreate it based on the string, but if I start converting each column to string, I might as well use a text format, like csv, instead of a binary one like hdf5.我知道我可以将列表转换为字符串,然后根据字符串重新创建它,但如果我开始将每一列转换为字符串,我还不如使用文本格式,如 csv,而不是像 hdf5 这样的二进制格式。

Is there a standard way of saving lists into hdf5 table format?是否有将列表保存为 hdf5 表格格式的标准方法?

Python Lists present a challenge when writing to HDF5 because they may contain different types. Python 列表在写入 HDF5 时提出了挑战,因为它们可能包含不同的类型。 For example, this is a perfectly valid list: [1, 'two', 3.0] .例如,这是一个完全有效的列表: [1, 'two', 3.0] Also, if I understand your Pandas 'column1' dataframe, it may contain different length lists.另外,如果我理解你的 Pandas 'column1' dataframe,它可能包含不同长度的列表。 There is no (simple) way to represent this as an HDF5 dataset.没有(简单的)方法可以将其表示为 HDF5 数据集。 [That's why you got the [mixed] object dtype message. [这就是为什么您收到[mixed] object dtype消息的原因。 The conversion of the dataframe creates an intermediate object that is written as a dataset. dataframe 的转换创建了一个中间值 object,它被写为一个数据集。 The dtype of the converted list data is "O" (object), and HDF5 doesn't support this type.]转换后的列表数据的dtype为“O”(object),HDF5不支持该类型。]

However, all is not lost.然而,一切并没有丢失。 If we can make some assumptions about your data, we can wrangle it into a HDF5 dataset.如果我们可以对您的数据做出一些假设,我们可以将其整理成 HDF5 数据集。 Assumptions: 1) all df list entities are the same type (int in this case), and 2) all df lists are the same length.假设:1) 所有 df 列表实体都是相同类型(在本例中为 int),以及 2) 所有 df 列表的长度相同。 (We can handle different length lists, but it is more complicated.) Also, you will need to use a different package to write the HDF5 data (either PyTables or h5py). (我们可以处理不同长度的列表,但它更复杂。)此外,您将需要使用不同的 package 来写入 HDF5 数据(PyTables 或 h5py)。 PyTables is the underlying package for Pandas HDF5 support and h5py is widely used. PyTables 是 package 的底层 Pandas HDF5 支持,h5py 被广泛使用。 The choice is yours.这是你的选择。

Before I post the code, here is an outline of the process:在我发布代码之前,这里是一个过程的概述:

  1. Create a NumPy record array (aka recarray) from the the dataframe从 dataframe 创建一个 NumPy 记录数组(aka recarray)
  2. Define the desired type and shape for the HDF5 dataset (as an Atom for Pytables, or a dtype for h5py).为 HDF5 数据集定义所需的类型和形状(作为 Pytables 的 Atom,或 h5py 的 dtype)。
  3. Create the dataset with Ataom/dtype definition above (could do on 1 line, but easier to read this way).使用上面的 Ataom/dtype 定义创建数据集(可以在 1 行上完成,但这样更容易阅读)。
  4. Loop over rows of the recarray (from Step 1), and write data to rows of the dataset.遍历 recarray 的行(来自步骤 1),并将数据写入数据集的行。 This converts the List to the equivalent array.这会将 List 转换为等效数组。

Code to create recarray (adds 2 rows to your dataframe):创建 recarray 的代码(向数据框添加 2 行):

import pandas as pd
test = pd.DataFrame({"column1":[list(range(0,5)), list(range(10,15)), list(range(100,105))]})
# create recarray from the dataframe (use index='column1' to only get that column)
rec_arr = test.to_records(index=False)

PyTables specific code to export data: PyTables 导出数据的具体代码:

import tables as tb
with tb.File('74489101_tb.h5', 'w') as h5f:
    # define "atom" with type and shape of column1 data
    df_atom = tb.Atom.from_type('int32', shape=(len(rec_arr[0]['column1']),) )
    # create the dataset
    test = h5f.create_array('/','test', shape=rec_arr.shape, atom=df_atom )
    # loop over recarray and polulate dataset
    for i in range(rec_arr.shape[0]):
        test[i] = rec_arr[i]['column1']
    print(test[:])  

h5py specific code to export data: h5py导出数据的具体代码:

import h5py
with h5py.File('74489101_h5py.h5', 'w') as h5f:
    df_dt = (int,(len(rec_arr1[0]['column1']),))
    test = h5f.create_dataset('test', shape=rec_arr1.shape, dtype=df_dt )
    for i in range(rec_arr1.shape[0]):
        test[i] = rec_arr1[i]['column1']
    print(test[:]) 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM