简体   繁体   English

保存 numpy 数组的字典

[英]Saving dictionary of numpy arrays

So I have a DB with a couple of years worth of site data.所以我有一个数据库,里面有几年的站点数据。 I am now attempting to use that data for analytics - plotting and sorting of advertising costs by keyword, etc.我现在正尝试使用该数据进行分析 - 按关键字等绘制和排序广告成本。

One of the data grabs from the DB takes minutes to complete.从数据库中获取的数据之一需要几分钟才能完成。 While I could spend some time optimizing the SQL statements I use to get the data I'd prefer to simply leave that class and it's SQL alone, grab the data, and save the results to a data file for faster retrieval later.虽然我可以花一些时间优化我用来获取数据的 SQL 语句,但我更愿意简单地离开那个类,它是 SQL,抓取数据,并将结果保存到数据文件中,以便以后更快地检索。 Most of this DB data isn't going to change so I could write a separate python script to update the file every 24 hours and then use that file for this long running task.大多数数据库数据不会改变,所以我可以编写一个单独的 python 脚本来每 24 小时更新一次文件,然后将该文件用于这个长时间运行的任务。

The data is being returned as a dictionary of numpy arrays.数据作为 numpy 数组的字典返回。 When I use numpy.save('data', data) the file is saved just fine.当我使用numpy.save('data', data)文件保存得很好。 When I use data2 = numpy.load('data.npy') it loads the file without error.当我使用data2 = numpy.load('data.npy')它加载文件时没有错误。 However, the output data2 doesn't not equal the original data .但是,输出data2不等于原始data

Specifically the line data == data2 returns false.特别是行data == data2返回 false。 Additionally, if I use the following:此外,如果我使用以下内容:

for key, key_data in data.items():
  print key

it works.有用。 But when I replace data.items() with data2.items() then I get an error:但是当我用data.items()替换data2.items()然后我得到一个错误:

AttributeError: 'numpy.ndarray' object has no attribute 'items'

Using type(data) I get dict .使用type(data)我得到dict Using type(data2) I get numpy.ndarray .使用type(data2)我得到numpy.ndarray

So how do I fix this?那么我该如何解决这个问题? I want the loaded data to equal the data I passed in for saving.我希望加载的数据等于我为保存而传入的数据。 Is there an argument to numpy.save to fix this or do I need some form of simple reformatting function to reformat the loaded data into the proper structure? numpy.save 是否有一个参数来解决这个问题,或者我是否需要某种形式的简单重新格式化函数来将加载的数据重新格式化为正确的结构?

Attempts to get into the ndarray via for loops or indexing all lead to errors about indexing a 0-d array.尝试通过 for 循环或索引进入ndarray都会导致有关索引 0-d 数组的错误。 Casting like this dict(data2) also fails for iterating over a 0-d array.像这样的dict(data2)也无法迭代 0-d 数组。 However, Spyder shows value of the array and it includes the data I saved.但是,Spyder 显示了数组的值,其中包括我保存的数据。 I just can't figure out how to get to it.我只是不知道如何到达它。

If I need to reformat the loaded data I'd appreciate some example code on how to do this.如果我需要重新格式化加载的数据,我会很感激一些关于如何执行此操作的示例代码。

Let's look at a small example:让我们看一个小例子:

In [819]: N
Out[819]: 
array([[  0.,   1.,   2.,   3.],
       [  4.,   5.,   6.,   7.],
       [  8.,   9.,  10.,  11.]])

In [820]: data={'N':N}

In [821]: np.save('temp.npy',data)

In [822]: data2=np.load('temp.npy')

In [823]: data2
Out[823]: 
array({'N': array([[  0.,   1.,   2.,   3.],
       [  4.,   5.,   6.,   7.],
       [  8.,   9.,  10.,  11.]])}, dtype=object)

np.save is designed to save numpy arrays. np.save旨在保存 numpy 数组。 data is a dictionary. data是一本字典。 So it wrapped it in a object array, and used pickle to save that object.所以它将它包装在一个对象数组中,并使用pickle来保存该对象。 Your data2 probably has the same character.您的data2可能具有相同的字符。

You get at the array with:你得到数组:

In [826]: data2[()]['N']
Out[826]: 
array([[  0.,   1.,   2.,   3.],
       [  4.,   5.,   6.,   7.],
       [  8.,   9.,  10.,  11.]])

I really liked the deepdish (it saves them in HDF5 format):我真的很喜欢deepdish (它以HDF5格式保存):

>>> import deepdish as dd
>>> d = {'foo': np.arange(10), 'bar': np.ones((5, 4, 3))}
>>> dd.io.save('test.h5', d)

$ ddls test.h5
/bar                       array (5, 4, 3) [float64]
/foo                       array (10,) [int64]

>>> d = dd.io.load('test.h5')

for my experience, it seems to be partially broken for large datasets, though :(根据我的经验,对于大型数据集,它似乎部分被破坏了:(

When saving a dictionary with numpy, the dictionary is encoded into an array.当用 numpy 保存字典时,字典被编码成一个数组。 To have what you need, you can do as in this example:要获得您需要的东西,您可以按照以下示例进行操作:

my_dict = {'a' : np.array(range(3)), 'b': np.array(range(4))}

np.save('my_dict.npy',  my_dict)    

my_dict_back = np.load('my_dict.npy')

print(my_dict_back.item().keys())    
print(my_dict_back.item().get('a'))

So you are probably missing .item() for the reloaded dictionary.因此,您可能缺少.item()重新加载的字典。 Check this out:看一下这个:

for key, key_d in data2.item().items():
    print key, key_d

The comparison my_dict == my_dict_back.item() works only for dictionaries that does not have lists or arrays in their values.比较my_dict == my_dict_back.item()仅适用于值中没有列表或数组的字典。


EDIT: for the item() issue mentioned above, I think it is a better option to save dictionaries with the library pickle rather than with numpy .编辑:对于上面提到的item()问题,我认为使用库pickle而不是numpy保存字典是更好的选择。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM