pandas：用python2.7写hdf，用python3.7读

Question

我有一个 dataframe，它的值包括 arrays，我在 python2.7 中将它写入 hdf 并想在 python3.7 中读取它 - 并得到一个 UnicodeDecodeError。 我关注了这个问题https://github.com/pandas-dev/pandas/issues/17540 ，但无法实施建议的解决方案：

save with format='table对于 object 类型是不可能的。
使用encoding='utf-8'保存有效，但读取 hdf raise: TypeError: lookup() argument must be str, not numpy.bytes_

还有其他解决办法吗？

Answer 1

如果将 numpy arrays 保存在单独的密钥下，则可以使用format='table' 。 这是如何执行此操作的示例。

首先创建一个示例数据集。 我只用 1d 和 2d arrays 对此进行了测试，但它应该适用于任何维度。 我还导入了 h5py，所以我可以直接读/写 hdf5 文件。

import pandas as pd
import numpy as np
import h5py

df = pd.DataFrame({'time': [0, 1, 2], 'signal':[np.array([1, 2, 3], dtype='int'),
                                                np.array([2, 0, 1], dtype='int'),
                                                np.array([3, 3, 4], dtype='int')],
                                      'signal2':[np.array([2, 3, 4], dtype='int'),
                                                np.array([1, 1, 1], dtype='int'),
                                                np.array([2, 2, 2], dtype='int')]})

接下来，定义保存 function。此代码访问列中无法正确编码的每个单元格，并将每个单元格保存在不同的键下。 （注意：这假设您有一个唯一索引。）

def save(df, filename, cols_to_save_separately):
    with h5py.File(filename, "w") as f:
        for col in cols_to_save_separately:
            for i, array in df[col].iteritems():
                dataset_key = f"{col}_{i}"
                f.create_dataset(dataset_key, data=array)
    for col in cols_to_save_separately:
        df = df.drop(col, axis=1)
    df.to_hdf(filename, key='tab', mode='a', format='table',
                 data_columns=['time'])

执行相反的操作以加载文件。

def read(filename, cols_to_save_separately):
    new_df = pd.read_hdf(filename, key='tab')
    with h5py.File(filename, "r") as f:
        for col in cols_to_save_separately:
            new_df[col] = pd.Series([f[f"{col}_{i}"][:] for i in new_df.index])
    return new_df

下面是如何使用此代码的示例。 我们删除列 signal 和 signal2，因为它们包含 arrays。时间列可以单独保留。

print("Prior to save")
print(df)
save(df, 'events.h5', cols_to_save_separately=['signal', 'signal2'])
print("After save")
print(read('events.h5', cols_to_save_separately=['signal', 'signal2']))

这会产生以下 output：

Prior to save
   time     signal    signal2
0     0  [1, 2, 3]  [2, 3, 4]
1     1  [2, 0, 1]  [1, 1, 1]
2     2  [3, 3, 4]  [2, 2, 2]
After save
   time     signal    signal2
0     0  [1, 2, 3]  [2, 3, 4]
1     1  [2, 0, 1]  [1, 1, 1]
2     2  [3, 3, 4]  [2, 2, 2]

pandas：用python2.7写hdf，用python3.7读

问题描述

1 个解决方案

解决方案1
0 2020-11-15 21:39:54

pandas：用python2.7写hdf，用python3.7读

问题描述

1 个解决方案

解决方案1 0 2020-11-15 21:39:54

解决方案1
0 2020-11-15 21:39:54