[英]how to save h5py arrays with different sizes?
我指的是這個問題本 。 我正在制作這個新主題,是因為我不太了解那里給出的答案,希望有人可以向我進一步解釋。
基本上我的問題就像那里的鏈接一樣。之前,我使用np.vstack
並np.vstack
創建h5
格式文件。 以下是我的示例:
import numpy as np
import h5py
import glob
path="/home/ling/test/"
def runtest():
data1 = [np.loadtxt(file) for file in glob.glob(path + "data1/*.csv")]
data2 = [np.loadtxt(file) for file in glob.glob(path + "data2/*.csv")]
stack = np.vstack((data1, data2))
h5f = h5py.File("/home/ling/test/2test.h5", "w")
h5f.create_dataset("test_data", data=stack)
h5f.close()
如果大小都相同,這將非常有效。 但是,當大小不同時,會拋出錯誤TypeError: Object dtype dtype('O') has no native HDF5 equivalent
從那里給出的答案可以理解,我必須將數組另存為單獨的數據集,但要查看給出的示例代碼段; for k,v in adict.items()
和grp.create_dataset(k,data=v)
, k
應該是正確的數據集名稱嗎? 就像我的示例test_data
? v
是什么?
以下是vstack
和stack
外觀
[[array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.07812, 0. ])
array([-0.07812, -0.07812, -0.11719, ..., -0.07812, -0.07812, 0. ])
array([ 0.07812, 0.07812, 0.07812, ..., 0.07812, 0.07812, 0. ])
array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.07812, 0. ])
array([ 0.07812, 0.07812, 0.07812, ..., 0.07812, 0.07812, 0. ])
array([ 0.03906, 0.07812, 0.07812, ..., 0.07812, 0.07812, 0. ])
array([ 0.07812, 0.07812, 0.07812, ..., 0.07812, 0.07812, 0. ])
array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.07812, 0. ])
array([ 0.07812, 0.07812, 0.07812, ..., 0.07812, 0.11719, 0. ])
array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.07812, 0. ])
array([ 0.07812, 0.07812, 0.07812, ..., 0.07812, 0.07812, 0. ])
array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.07812, 0. ])
array([-0.15625, -0.07812, -0.07812, ..., -0.07812, -0.07812, 0. ])
array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.07812, 0. ])
array([-0.11719, -0.07812, -0.07812, ..., -0.07812, -0.07812, 0. ])
array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.15625, 0. ])
array([ 0.07812, 0.07812, 0.07812, ..., 0.07812, 0.07812, 0. ])
array([-0.07812, -0.07812, -0.07812, ..., -0.11719, -0.07812, 0. ])
array([ 0.07812, 0.07812, 0.07812, ..., 0.07812, 0.07812, 0. ])
array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.07812, 0. ])
array([ 0.07812, 0.07812, 0.07812, ..., 0.07812, 0.07812, 0. ])
array([-0.07812, -0.11719, -0.07812, ..., -0.07812, -0.07812, 0. ])
array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.07812, 0. ])
array([ 0.07812, 0.03906, 0.07812, ..., 0.03906, 0.07812, 0. ])
array([ 0.03906, 0.07812, 0.07812, ..., 0.07812, 0.07812, 0. ])
array([-0.07812, -0.07812, -0.07812, ..., -0.07812, -0.11719, 0. ])
array([ 0.07812, 0.07812, 0.07812, ..., 0.07812, 0.07812, 0. ])
array([ 0.07812, 0.07812, 0.07812, ..., 0.07812, 0.07812, 0. ])
array([ 0.07812, 0.07812, 0.07812, ..., 0.07812, 0.07812, 0. ])
array([ 0.07812, 0.07812, 0.07812, ..., 0.07812, 0.07812, 0. ])]
[ array([ 10.9375 , 10.97656, 10.97656, ..., 11.05469, 11.05469, 1. ])
array([ 11.01562, 11.01562, 11.01562, ..., 11.09375, 11.09375, 1. ])
array([ 11.09375, 11.09375, 11.09375, ..., 11.09375, 11.09375, 1. ])
array([ 10.97656, 11.01562, 11.01562, ..., 11.13281, 11.09375, 1. ])
array([ 11.05469, 11.05469, 11.01562, ..., 11.09375, 11.09375, 1. ])
array([ 11.05469, 11.05469, 11.05469, ..., 11.05469, 11.05469, 1. ])
array([ 11.05469, 11.05469, 11.05469, ..., 11.05469, 11.13281, 1. ])
array([ 11.05469, 11.09375, 11.09375, ..., 11.09375, 11.09375, 1. ])
array([ 11.09375, 11.05469, 11.09375, ..., 11.05469, 11.05469, 1. ])
array([ 11.05469, 11.05469, 11.05469, ..., 11.09375, 11.09375, 1. ])
array([ 11.05469, 11.05469, 11.09375, ..., 11.05469, 11.05469, 1. ])
array([ 10.97656, 10.97656, 10.97656, ..., 11.05469, 11.05469, 1. ])
array([ 11.09375, 11.05469, 11.09375, ..., 11.09375, 11.09375, 1. ])
array([ 11.05469, 11.05469, 11.05469, ..., 11.05469, 11.05469, 1. ])
array([ 11.05469, 11.05469, 11.05469, ..., 11.09375, 11.17188, 1. ])
array([ 11.09375, 11.09375, 11.09375, ..., 10.97656, 11.09375, 1. ])
array([ 11.09375, 11.09375, 11.09375, ..., 11.05469, 11.05469, 1. ])
array([ 11.05469, 11.05469, 11.05469, ..., 11.05469, 11.05469, 1. ])
array([ 11.05469, 11.01562, 11.05469, ..., 11.01562, 11.01562, 1. ])
array([ 10.78125, 10.78125, 10.78125, ..., 11.05469, 11.05469, 1. ])
array([ 11.13281, 11.09375, 11.13281, ..., 11.09375, 11.09375, 1. ])
array([ 11.13281, 11.09375, 11.09375, ..., 11.05469, 11.05469, 1. ])
array([ 10.97656, 10.97656, 10.9375 , ..., 11.05469, 11.05469, 1. ])
array([ 11.05469, 11.09375, 11.05469, ..., 11.09375, 11.09375, 1. ])
array([ 10.9375 , 10.9375 , 10.9375 , ..., 11.09375, 11.09375, 1. ])
array([ 11.05469, 11.05469, 11.05469, ..., 11.05469, 11.05469, 1. ])
array([ 10.9375 , 10.89844, 10.9375 , ..., 11.05469, 11.09375, 1. ])
array([ 10.9375 , 10.97656, 10.97656, ..., 11.05469, 11.05469, 1. ])
array([ 10.89844, 10.89844, 10.89844, ..., 11.05469, 11.09375, 1. ])
array([ 11.05469, 11.05469, 11.05469, ..., 11.01562, 11.01562, 1. ])]]
感謝您的幫助和解釋。
我通過使用熊貓解決了這個問題。 最初,我使用了Pierre de Buyl的確切建議,但是當我嘗試加載/讀取文件/數據集時,它給了我錯誤。 我嘗試使用test_data = h5f["data1/file1"][:]
。 這給了我一個錯誤,說Unable to open object(Object 'file1' does not exist)
。
我通過使用pandas.read_hdf
讀取2test.h5
進行了pandas.read_hdf
,它顯示文件為空。 我在網上搜索其他解決方案,發現了這一點。 我已經修改了它:
import numpy as np
import glob
import pandas as pd
path = "/home/ling/test/"
def runtest():
data1 = [np.loadtxt(file) for file in glob.glob(path + "data1/*.csv")]
data2 = [np.loadtxt(file) for file in glob.glob(path + "data2/*.csv")]
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
combine = df1.append(df2, ignore_index=True)
# sort the NaN to the left
combinedf = combine.apply(lambda x : sorted(x, key=pd.notnull), 1)
combinedf.to_hdf('/home/ling/test/2test.h5', 'twodata')
runtest()
為了閱讀,我只是使用
input_data = pd.read_hdf('2test.h5', 'twodata')
read_input = input_data.values
read1 = read_input[:, -1] # read/get last column for example
HDF5文件中的基本元素是組(類似於目錄)和數據集(類似於數組)。
NumPy將創建一個包含許多不同輸入的數組。 當嘗試從完全不同的元素(即不同長度)創建數組時,NumPy返回類型為'O'的數組。 在NumPy參考指南中查找object_
。 然后,使用NumPy幾乎沒有優勢,因為它類似於標准的Python列表。
HDF5無法存儲類型為“ O”的數組,因為它沒有通用數據類型(僅對C結構類型對象提供某些支持)。
解決問題的最明顯方法是將數據存儲在HDF5數據集中,每個表“一個數據集”。 您保留了將數據收集到單個文件中的優勢,並且可以對元素進行“類似於字典的訪問”。
嘗試以下代碼:
import numpy as np
import h5py
import glob
path="/home/ling/test/"
def runtest():
h5f = h5py.File("/home/ling/test/2test.h5", "w")
h5f.create_group('data1')
h5f.create_group('data2')
[h5f.create_dataset(file[:-4], data=np.loadtxt(file)) for file in glob.glob(path + "data1/*.csv")]
[h5f.create_dataset(file[:-4], data=np.loadtxt(file)) for file in glob.glob(path + "data2/*.csv")]
h5f.close()
閱讀:
h5f = h5py.File("/home/ling/test/2test.h5", "r")
test_data = h5f['data1/thefirstfilenamewithoutcsvextension'][:]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.