简体   繁体   English

从嵌套为 python 中的字典值的两个列表中检索数据的最有效方法

[英]Most efficient way to retrieve data from two lists nested as values of dictionary in python

I have a dictionary of coordinates from structures in three dimensional space with我有一个来自三维空间结构的坐标字典

    struc_dict = { 
    'struc1' : [np.array(x,y,z), np.array(x,y,z), np.array(x,y,z), ...],
    'struc2' : [np.array(x,y,z), np.array(x,y,z), np.array(x,y,z), ...], 
    'struc3' : [np.array(x,y,z), np.array(x,y,z), np.array(x,y,z), ...], 
    'struc4' : [np.array(x,y,z), np.array(x,y,z), np.array(x,y,z), ...] } 

As an example:举个例子:

struc_dict = { 
    'struc1' : [[-31.447,  -4.428, -28.285], [-32.558,  -2.108, -29.213], [-31.656,  -4.071, -30.89 ], [-33.899,  -4.504, -29.349]],
    'struc2' : [[-27.487, -15.05,  -31.418], [-29.178, -14.63,  -33.498], [-29.548, -16.754, -31.937], [-30.028, -14.278, -30.977]], 
    'struc3' : [[-16.07,   -2.042, -29.853], [-16.734,  -4.162, -29.905], [-16.279,  -4.438, -28.936], [-16.544,  -4.098, -31.514]]} 

And I would like to find out the shortest distance between each of the structures.我想找出每个结构之间的最短距离。 So I would like to go through the dictionary, grab a pair of values and calculate the shortest distance.所以我想通过字典go,抓取一对值,计算最短距离。

My current code looks like that, but it's not very pretty or efficient:我当前的代码看起来像这样,但它不是很漂亮或高效:

import numpy as np
for s1 in struc_dict.keys():
    for s2 in struc_dict.keys():
        # only consider distances between two structures
        if s1 == s2:
            continue
        else:
            # defining an arbitrary max value, necessary for the first comparison?
            min_dist = 10000
            for c1 in struc_dict[s1]:
                for c2 in struc_dict[s2]:
                    # calculates the distance between the two coordinates
                    if np.linalg.norm(np.array(c1)-np.array(c2)) <= min_dist:
                        min_dist = np.linalg.norm(np.array(c1)-np.array(c2))

            print("Min dist between {s1} & {s2} : {min:.3f} units".format(s1=s1, s2=s2, min=min_dist))

Output for the example:以 Output 为例:

Min dist between struc1 & struc2 : 10.309 units
Min dist between struc1 & struc3 : 14.804 units
Min dist between struc2 & struc1 : 10.309 units
Min dist between struc2 & struc3 : 15.377 units
Min dist between struc3 & struc1 : 14.804 units
Min dist between struc3 & struc2 : 15.377 units

This code works, but calculates the distances between two structures two times, since it has to go through the dictionary twice.此代码有效,但计算两个结构之间的距离两次,因为它必须通过字典两次 go。 Also, I need a large min_dist start value for the first comparison for each two structures, but it there a way around that?另外,对于每两个结构的第一次比较,我需要一个大的 min_dist 起始值,但是有办法解决吗?

In general, there must be a more elegant solution for that.一般来说,必须有一个更优雅的解决方案。 Thanks!谢谢!

As for more elegant solution consider itertools.product .至于更优雅的解决方案,请考虑itertools.product Consider following simple example:考虑以下简单示例:

import itertools
points = {'A': (1,1), 'B': (2,2), 'C': (3,3)}
def dist(a, b):
    return ((a[0]-b[0])**2+(a[1]-b[1])**2)**0.5
for p1, p2 in itertools.product(points.keys(), repeat=2):
    print('Distance between',p1,'and',p2,'is',dist(points[p1],points[p2]))

Output: Output:

Distance between A and A is 0.0
Distance between A and B is 1.4142135623730951
Distance between A and C is 2.8284271247461903
Distance between B and A is 1.4142135623730951
Distance between B and B is 0.0
Distance between B and C is 1.4142135623730951
Distance between C and A is 2.8284271247461903
Distance between C and B is 1.4142135623730951
Distance between C and C is 0.0

This allows to avoid one nesting level, as opposed to for inside for .这允许避免一个嵌套级别,而不是for inside for

From your first post it was not clear whether the number of coordinates is the same across structures, so I assumed it was not.从您的第一篇文章中,不清楚跨结构的坐标数是否相同,所以我认为不是。

Here is a slightly revised version of your naive approach and a first improved version exploiting the fast low-level vectorization of NumPy.这是您幼稚方法的略微修改版本和第一个改进版本,利用了 NumPy 的快速低级矢量化。

import numpy as np

def naive(data):
  res = np.inf
  for k1, v1 in data.items():
    for k2, v2 in data.items():
      if k1 == k2:
        continue
      for c1 in v1:
        for c2 in v2:
          res = np.minimum(res, np.sum((c1 - c2)**2))
  return np.sqrt(res)

def version1(data):
  res = np.inf
  for k1, v1 in data.items():
    for k2, v2 in data.items():
      if k1 == k2:
        continue
      res = np.minimum(res, np.min(np.sum((v1[None, ...] - v2[:, None, :])**2, axis=-1)))
  return np.sqrt(res)

The crucial point is v1[None, ...] - v2[:, None, :] where, by adding an additional axis to each structure in a different location, we exploit the NumPy broadcasting to remove the two inner loops.关键点是v1[None, ...] - v2[:, None, :]其中,通过向不同位置的每个结构添加额外的轴,我们利用 NumPy 广播来删除两个内部循环。

Testing on your data (needs IPython, just to use the simplified interface to timeit ):测试您的数据(需要 IPython,只是为了使用简化的接口timeit ):

struc_dict = { 
    'struc1' : [[-31.447,  -4.428, -28.285], [-32.558,  -2.108, -29.213], [-31.656,  -4.071, -30.89 ], [-33.899,  -4.504, -29.349]],
    'struc2' : [[-27.487, -15.05,  -31.418], [-29.178, -14.63,  -33.498], [-29.548, -16.754, -31.937], [-30.028, -14.278, -30.977]], 
    'struc3' : [[-16.07,   -2.042, -29.853], [-16.734,  -4.162, -29.905], [-16.279,  -4.438, -28.936], [-16.544,  -4.098, -31.514]]}
data = {k: np.array(v) for k,v in struc_dict.items()}
%timeit naive(data)
%timeit version1(data)

Output: Output:

1000 loops, best of 3: 433 µs per loop
10000 loops, best of 3: 55.7 µs per loop

To better assess performance, let's try with more data:为了更好地评估性能,让我们尝试使用更多数据:

np.random.seed(42)

for n in [10, 20, 50, 100]:
  for max_size in [10, 20, 50, 100]:
    data = {str(i): np.random.normal(size=[np.random.randint(1, max_size), 3])
            for i in range(n)}
    print("Measuring for n=%r and max_size=%r" % (n, max_size))
    %timeit naive(data)
    %timeit version1(data)

Output: Output:

Measuring for n=10 and max_size=10
100 loops, best of 3: 10.6 ms per loop
1000 loops, best of 3: 784 µs per loop
Measuring for n=10 and max_size=20
10 loops, best of 3: 32.5 ms per loop
1000 loops, best of 3: 894 µs per loop
Measuring for n=10 and max_size=50
1 loop, best of 3: 323 ms per loop
1000 loops, best of 3: 1.94 ms per loop
Measuring for n=10 and max_size=100
1 loop, best of 3: 478 ms per loop
100 loops, best of 3: 2.36 ms per loop
Measuring for n=20 and max_size=10
10 loops, best of 3: 43.3 ms per loop
100 loops, best of 3: 3.32 ms per loop
Measuring for n=20 and max_size=20
10 loops, best of 3: 188 ms per loop
100 loops, best of 3: 3.78 ms per loop
Measuring for n=20 and max_size=50
1 loop, best of 3: 1.41 s per loop
100 loops, best of 3: 8.09 ms per loop
Measuring for n=20 and max_size=100
1 loop, best of 3: 4.34 s per loop
100 loops, best of 3: 17.8 ms per loop
Measuring for n=50 and max_size=10
1 loop, best of 3: 341 ms per loop
10 loops, best of 3: 22.6 ms per loop
Measuring for n=50 and max_size=20
1 loop, best of 3: 1.24 s per loop
10 loops, best of 3: 24.3 ms per loop
Measuring for n=50 and max_size=50
1 loop, best of 3: 7.86 s per loop
10 loops, best of 3: 45.9 ms per loop
Measuring for n=50 and max_size=100
1 loop, best of 3: 22.6 s per loop
10 loops, best of 3: 97.1 ms per loop
Measuring for n=100 and max_size=10
1 loop, best of 3: 1.03 s per loop
10 loops, best of 3: 83.5 ms per loop
Measuring for n=100 and max_size=20
1 loop, best of 3: 4 s per loop
10 loops, best of 3: 96.1 ms per loop
Measuring for n=100 and max_size=50
1 loop, best of 3: 27.4 s per loop
10 loops, best of 3: 180 ms per loop
Measuring for n=100 and max_size=100
1 loop, best of 3: 1min 50s per loop
1 loop, best of 3: 447 ms per loop

Speed up varies a lot in function of number of coordinates (the more the better, as long as your machine have enough free RAM). function 坐标数的加速变化很大(越多越好,只要你的机器有足够的空闲内存)。

Further improvements can be achieved by:进一步的改进可以通过以下方式实现:

  1. Using np.einsum (Einstein summation) instead of np.sum , which is well known to be faster使用np.einsum (爱因斯坦求和)而不是np.sum ,这是众所周知的更快
  2. If all the structures have the same number of coordinates you can remove the other two loops by building two four-dimensional arrays (eg Nx1xMx3 and 1xNxMx3, where N is the number of structures and M is the number of coordinates for each).如果所有结构的坐标数相同,您可以通过构建两个四维 arrays(例如 Nx1xMx3 和 1xNxMx3,其中 N 是结构数,M 是每个结构的坐标数)来删除其他两个循环。
  3. Combine both the previous points!结合前面的两点!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM