使用 numpy arrays 保存字典列表

Question

I have a dataset composed as:我有一个数据集组成：

dataset = [{"sample":[numpy array (2048,3) shape], "category":"Cat"}, ....]

Each element of the list is a dictionary containing a key "sample" and its value is a numpy array that has shape (2048,3) and the category is the class of that sample.列表的每个元素都是一个包含键“样本”的字典，其值是一个具有形状 (2048,3) 的 numpy 数组，类别是该样本的 class。 The dataset len is 8000.数据集 len 为 8000。

I tried to save in JSON but it said it can't serialize numpy arrays.我试图保存在 JSON 但它说它不能序列化 numpy arrays。

What's the best way to save this list?保存此列表的最佳方法是什么？ I can't use np.save("file", dataset) because there is a dictionary and I can't use JSON because there is the numpy array.我不能使用np.save("file", dataset)因为有字典，我不能使用 JSON 因为有 numpy 数组。 Should I use HDF5?我应该使用 HDF5 吗？ What format should I use if I have to use the dataset for machine learning?如果我必须使用数据集进行机器学习，我应该使用什么格式？ Thanks!谢谢！

Answer 1

Creating an example specific to your data requires more details about the dictionaries in the list.创建特定于您的数据的示例需要有关列表中字典的更多详细信息。 I created an example that assumes every dictionary has:我创建了一个示例，假设每个字典都有：

A unique value for the category key. category键的唯一值。 The value is used for the dataset name.该值用于数据集名称。
There is a sample key with the array you want to save.您要保存的数组有一个sample键。

Code below creates some data, loads to a HDF5 file with h5py package, then reads the data back into a new list of dictionaries.下面的代码创建一些数据，使用 h5py package 加载到 HDF5 文件，然后将数据读回新的字典列表。 It is a good starting point for your problem.这是您解决问题的一个很好的起点。

import numpy as np
import h5py

a0, a1 = 10, 5
arr1 = np.arange(a0*a1).reshape(a0,a1)
arr2 = np.arange(a0*a1,2*a0*a1).reshape(a0,a1)
arr3 = np.arange(2*a0*a1,3*a0*a1).reshape(a0,a1)

dataset = [{"sample":arr1, "category":"Cat"}, 
           {"sample":arr2, "category":"Dog"},
           {"sample":arr3, "category":"Fish"},
           ]

# Create the HDF5 file with "category" as dataset name and "sample" as the data
with h5py.File('SO_73499414.h5', 'w') as h5f:
    for ds_dict in dataset:
        h5f.create_dataset(ds_dict["category"], data=ds_dict["sample"])

# Retrieve the HDF5 data with "category" as dataset name and "sample" as the data
ds_list = []
with h5py.File('SO_73499414.h5', 'r') as h5f:
    for ds_name in h5f:
        print(ds_name,'\n',h5f[ds_name]) # prints name and dataset attributes
        print(h5f[ds_name][()]) # prints the dataset values (as an array) 
        # add data and name to list
        ds_list.append({"sample":h5f[ds_name][()], "category":ds_name})

使用 numpy arrays 保存字典列表

问题描述

1 个解决方案

解决方案1
0 2022-08-26 15:53:24

使用 numpy arrays 保存字典列表

问题描述

1 个解决方案

解决方案1 0 2022-08-26 15:53:24

解决方案1
0 2022-08-26 15:53:24