简体   繁体   English

Python:比 arrays 的嵌套字典更有效的数据结构?

[英]Python: more efficient data structure than a nested dictionary of dictionaries of arrays?

I'm writing a python-3.10 program that predicts time series of various properties for a large number of objects.我正在编写一个 python-3.10 程序来预测大量对象的各种属性的时间序列。 My current choice of data structure for collecting results internally in the code and then for writing to files is a nested dictionary of dictionaries of arrays. For example, for two objects with time series of 3 properties:我当前选择的用于在代码内部收集结果然后写入文件的数据结构是 arrays 字典的嵌套字典。例如,对于具有 3 个属性的时间序列的两个对象:

properties = {'obj1':{'time':np.arange(10),'x':np.random.randn(10),'vx':np.random.randn(10)},
'obj2': {'time':np.arange(15),'x':np.random.randn(15),'vx':np.random.randn(15)}}

The reason I like this nested dictionary format is because it is intuitive to access -- the outer key is the object name, and the inner keys are the property names.我喜欢这种嵌套字典格式的原因是它访问起来很直观——外键是 object 名称,内键是属性名称。 The elements corresponding to each of the inner keys are numpy arrays giving the value of some property as a function of time.对应于每个内部键的元素是 numpy arrays 给出一些属性的值作为时间的 function。 My actual code generates a dict of ~100,000s of objects (outer keys) each having ~100 properties (inner keys) recorded at ~1000 times (numpy float arrays).我的实际代码生成了一个包含约 100,000 个对象(外键)的字典,每个对象都有约 100 个属性(内键)记录在约 1000 次(numpy 浮点数组)。

I have noticed that when I do np.savez('filename.npz',**properties) on my own huge properties dictionary (or subsets of it), it takes a while and the output file sizes are a few GB (probably because np.savez is calling pickle under the hood since my nested dict is not an array).我注意到当我在我自己的巨大属性字典(或它的子集)上执行np.savez('filename.npz',**properties)时,它需要一段时间并且 output 文件大小是几 GB(可能是因为np.savez 在后台调用 pickle,因为我的嵌套字典不是数组)。

Is there a more efficient data structure widely applicable for my use case?是否有更高效的数据结构广泛适用于我的用例? Is it worth switching from my nested dict to pandas dataframes, numpy ndarrays or record arrays, or a list of some kind of Table-like objects?是否值得从我的嵌套字典切换到 pandas 数据帧、numpy ndarray 或记录 arrays,或某种类似表的对象的列表? It would be nice to be able to save/load the file in a binary output format that preserves the mapping from object names to their dict/array/table/dataframe of properties, and of course the names of each of the property time series arrays.能够以二进制 output 格式保存/加载文件会很好,该格式保留从 object 名称到属性的字典/数组/表/数据框的映射,当然还有每个属性时间序列的名称 arrays .

Let's look at your obj2 value, a dict:让我们看看你的obj2值,一个字典:

In [307]: dd={'time':np.arange(15),'x':np.random.randn(15),'vx':np.random.randn(15)}

In [308]: dd
Out[308]: 
{'time': array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14]),
 'x': array([-0.48197915,  0.15597792,  0.44113401,  1.38062753, -1.21273378,
        -1.27120008,  1.53072667,  1.9799255 ,  0.13647925, -1.37056793,
        -2.06470784,  0.92314969,  0.30885371,  0.64860014,  1.30273519]),
 'vx': array([-1.60228105, -1.49163002, -1.17061046, -0.09267467, -0.94133092,
         1.86391024,  1.006901  , -0.16168439,  1.5180135 , -1.16436363,
        -0.20254291, -1.60280149, -1.91749387,  0.25366602, -1.61993012])}

It's easy to make a dataframe from that:从中很容易得到 dataframe:

In [309]: df = pd.DataFrame(dd)

In [310]: df
Out[310]: 
    time         x        vx
0      0 -0.481979 -1.602281
1      1  0.155978 -1.491630
2      2  0.441134 -1.170610
3      3  1.380628 -0.092675
4      4 -1.212734 -0.941331
5      5 -1.271200  1.863910
6      6  1.530727  1.006901
7      7  1.979926 -0.161684
8      8  0.136479  1.518014
9      9 -1.370568 -1.164364
10    10 -2.064708 -0.202543
11    11  0.923150 -1.602801
12    12  0.308854 -1.917494
13    13  0.648600  0.253666
14    14  1.302735 -1.619930

We could also make structured array from that frame.我们还可以从该框架制作结构化数组。 I could also make the array directly from your dict, defining the same compound dtype.我也可以直接从你的字典创建数组,定义相同的复合数据类型。 But since I already have the frame, I'll go this route.但由于我已经有了框架,所以我将 go 这条路线。 The distinction between structured array and recarray is minor.结构化数组和 recarray 之间的区别很小。

In [312]: arr = df.to_records()

In [313]: arr
Out[313]: 
rec.array([( 0,  0, -0.48197915, -1.60228105),
           ( 1,  1,  0.15597792, -1.49163002),
           ( 2,  2,  0.44113401, -1.17061046),
           ( 3,  3,  1.38062753, -0.09267467),
           ( 4,  4, -1.21273378, -0.94133092),
           ( 5,  5, -1.27120008,  1.86391024),
           ( 6,  6,  1.53072667,  1.006901  ),
           ( 7,  7,  1.9799255 , -0.16168439),
           ( 8,  8,  0.13647925,  1.5180135 ),
           ( 9,  9, -1.37056793, -1.16436363),
           (10, 10, -2.06470784, -0.20254291),
           (11, 11,  0.92314969, -1.60280149),
           (12, 12,  0.30885371, -1.91749387),
           (13, 13,  0.64860014,  0.25366602),
           (14, 14,  1.30273519, -1.61993012)],
          dtype=[('index', '<i8'), ('time', '<i4'), ('x', '<f8'), ('vx', '<f8')])

Now let's compare the pickle strings:现在让我们比较泡菜字符串:

In [314]: import pickle

In [315]: len(pickle.dumps(dd))
Out[315]: 561

In [316]: len(pickle.dumps(df))      # df.to_pickle makes a 1079 byte file
Out[316]: 1052

In [317]: len(pickle.dumps(arr))     # arr.nbytes is 420
Out[317]: 738                        # np.save writes a 612 byte file

And other encoding - a list:和其他编码 - 列表:

In [318]: alist = list(dd.items())
In [319]: alist
Out[319]: 
[('time', array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])),
 ('x',
  array([-0.48197915,  0.15597792,  0.44113401,  1.38062753, -1.21273378,
         -1.27120008,  1.53072667,  1.9799255 ,  0.13647925, -1.37056793,
         -2.06470784,  0.92314969,  0.30885371,  0.64860014,  1.30273519])),
 ('vx',
  array([-1.60228105, -1.49163002, -1.17061046, -0.09267467, -0.94133092,
          1.86391024,  1.006901  , -0.16168439,  1.5180135 , -1.16436363,
         -0.20254291, -1.60280149, -1.91749387,  0.25366602, -1.61993012]))]
In [320]: len(pickle.dumps(alist))
Out[320]: 567

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 有没有比嵌套字典更好的数据结构来在 python 中创建学生数据库? - Is there a better data structure than nested dictionaries to create a student database in python? Python创建字典字典结构,嵌套值相同 - Python Creating a dictionary of dictionaries structure, nested values are the same Python中的嵌套数据结构有没有更高效更不丑陋的加载变长数据的方法? - Is there a more efficient and less ugly way to load variable-length data into a nested data structure in Python? 比压缩数组更有效的方法来在Python中转置表格? - More efficient way than zipping arrays for transposing a table in Python? 有没有比通过数组更有效的方法来处理大量数据? - Is there a more efficient method to process large amounts of data than through arrays? 将一级字典转换为嵌套字典的结构 - Convert one level dictionary to structure of nested dictionaries Python中多个词典和列表字典的高效快速数据存储和处理,以及列表的两个词典的交集 - Efficient and fast Data Storage and Processing in Python of multiple dictionaries and dictionary of lists ad intersection of two dictionaries of lists 最佳数据结构:字典数组,对象数组? - Best data structure: arrays of dictionaries, arrays of objects? 更多嵌套Python嵌套字典 - More nest Python nested dictionaries 使用difflib在python中搜索单词词典的有效数据结构? - Efficient data structure for searching a dictionary of words in python using difflib?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM