[英]Python: more efficient data structure than a nested dictionary of dictionaries of arrays?
I'm writing a python-3.10 program that predicts time series of various properties for a large number of objects.我正在编写一个 python-3.10 程序来预测大量对象的各种属性的时间序列。 My current choice of data structure for collecting results internally in the code and then for writing to files is a nested dictionary of dictionaries of arrays. For example, for two objects with time series of 3 properties:我当前选择的用于在代码内部收集结果然后写入文件的数据结构是 arrays 字典的嵌套字典。例如,对于具有 3 个属性的时间序列的两个对象:
properties = {'obj1':{'time':np.arange(10),'x':np.random.randn(10),'vx':np.random.randn(10)},
'obj2': {'time':np.arange(15),'x':np.random.randn(15),'vx':np.random.randn(15)}}
The reason I like this nested dictionary format is because it is intuitive to access -- the outer key is the object name, and the inner keys are the property names.我喜欢这种嵌套字典格式的原因是它访问起来很直观——外键是 object 名称,内键是属性名称。 The elements corresponding to each of the inner keys are numpy arrays giving the value of some property as a function of time.对应于每个内部键的元素是 numpy arrays 给出一些属性的值作为时间的 function。 My actual code generates a dict of ~100,000s of objects (outer keys) each having ~100 properties (inner keys) recorded at ~1000 times (numpy float arrays).我的实际代码生成了一个包含约 100,000 个对象(外键)的字典,每个对象都有约 100 个属性(内键)记录在约 1000 次(numpy 浮点数组)。
I have noticed that when I do np.savez('filename.npz',**properties)
on my own huge properties dictionary (or subsets of it), it takes a while and the output file sizes are a few GB (probably because np.savez is calling pickle under the hood since my nested dict is not an array).我注意到当我在我自己的巨大属性字典(或它的子集)上执行np.savez('filename.npz',**properties)
时,它需要一段时间并且 output 文件大小是几 GB(可能是因为np.savez 在后台调用 pickle,因为我的嵌套字典不是数组)。
Is there a more efficient data structure widely applicable for my use case?是否有更高效的数据结构广泛适用于我的用例? Is it worth switching from my nested dict to pandas dataframes, numpy ndarrays or record arrays, or a list of some kind of Table-like objects?是否值得从我的嵌套字典切换到 pandas 数据帧、numpy ndarray 或记录 arrays,或某种类似表的对象的列表? It would be nice to be able to save/load the file in a binary output format that preserves the mapping from object names to their dict/array/table/dataframe of properties, and of course the names of each of the property time series arrays.能够以二进制 output 格式保存/加载文件会很好,该格式保留从 object 名称到属性的字典/数组/表/数据框的映射,当然还有每个属性时间序列的名称 arrays .
Let's look at your obj2
value, a dict:让我们看看你的obj2
值,一个字典:
In [307]: dd={'time':np.arange(15),'x':np.random.randn(15),'vx':np.random.randn(15)}
In [308]: dd
Out[308]:
{'time': array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]),
'x': array([-0.48197915, 0.15597792, 0.44113401, 1.38062753, -1.21273378,
-1.27120008, 1.53072667, 1.9799255 , 0.13647925, -1.37056793,
-2.06470784, 0.92314969, 0.30885371, 0.64860014, 1.30273519]),
'vx': array([-1.60228105, -1.49163002, -1.17061046, -0.09267467, -0.94133092,
1.86391024, 1.006901 , -0.16168439, 1.5180135 , -1.16436363,
-0.20254291, -1.60280149, -1.91749387, 0.25366602, -1.61993012])}
It's easy to make a dataframe from that:从中很容易得到 dataframe:
In [309]: df = pd.DataFrame(dd)
In [310]: df
Out[310]:
time x vx
0 0 -0.481979 -1.602281
1 1 0.155978 -1.491630
2 2 0.441134 -1.170610
3 3 1.380628 -0.092675
4 4 -1.212734 -0.941331
5 5 -1.271200 1.863910
6 6 1.530727 1.006901
7 7 1.979926 -0.161684
8 8 0.136479 1.518014
9 9 -1.370568 -1.164364
10 10 -2.064708 -0.202543
11 11 0.923150 -1.602801
12 12 0.308854 -1.917494
13 13 0.648600 0.253666
14 14 1.302735 -1.619930
We could also make structured array from that frame.我们还可以从该框架制作结构化数组。 I could also make the array directly from your dict, defining the same compound dtype.我也可以直接从你的字典创建数组,定义相同的复合数据类型。 But since I already have the frame, I'll go this route.但由于我已经有了框架,所以我将 go 这条路线。 The distinction between structured array and recarray is minor.结构化数组和 recarray 之间的区别很小。
In [312]: arr = df.to_records()
In [313]: arr
Out[313]:
rec.array([( 0, 0, -0.48197915, -1.60228105),
( 1, 1, 0.15597792, -1.49163002),
( 2, 2, 0.44113401, -1.17061046),
( 3, 3, 1.38062753, -0.09267467),
( 4, 4, -1.21273378, -0.94133092),
( 5, 5, -1.27120008, 1.86391024),
( 6, 6, 1.53072667, 1.006901 ),
( 7, 7, 1.9799255 , -0.16168439),
( 8, 8, 0.13647925, 1.5180135 ),
( 9, 9, -1.37056793, -1.16436363),
(10, 10, -2.06470784, -0.20254291),
(11, 11, 0.92314969, -1.60280149),
(12, 12, 0.30885371, -1.91749387),
(13, 13, 0.64860014, 0.25366602),
(14, 14, 1.30273519, -1.61993012)],
dtype=[('index', '<i8'), ('time', '<i4'), ('x', '<f8'), ('vx', '<f8')])
Now let's compare the pickle strings:现在让我们比较泡菜字符串:
In [314]: import pickle
In [315]: len(pickle.dumps(dd))
Out[315]: 561
In [316]: len(pickle.dumps(df)) # df.to_pickle makes a 1079 byte file
Out[316]: 1052
In [317]: len(pickle.dumps(arr)) # arr.nbytes is 420
Out[317]: 738 # np.save writes a 612 byte file
And other encoding - a list:和其他编码 - 列表:
In [318]: alist = list(dd.items())
In [319]: alist
Out[319]:
[('time', array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])),
('x',
array([-0.48197915, 0.15597792, 0.44113401, 1.38062753, -1.21273378,
-1.27120008, 1.53072667, 1.9799255 , 0.13647925, -1.37056793,
-2.06470784, 0.92314969, 0.30885371, 0.64860014, 1.30273519])),
('vx',
array([-1.60228105, -1.49163002, -1.17061046, -0.09267467, -0.94133092,
1.86391024, 1.006901 , -0.16168439, 1.5180135 , -1.16436363,
-0.20254291, -1.60280149, -1.91749387, 0.25366602, -1.61993012]))]
In [320]: len(pickle.dumps(alist))
Out[320]: 567
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.