[英]Most efficient way to retrieve data from two lists nested as values of dictionary in python
I have a dictionary of coordinates from structures in three dimensional space with我有一个来自三维空间结构的坐标字典
struc_dict = {
'struc1' : [np.array(x,y,z), np.array(x,y,z), np.array(x,y,z), ...],
'struc2' : [np.array(x,y,z), np.array(x,y,z), np.array(x,y,z), ...],
'struc3' : [np.array(x,y,z), np.array(x,y,z), np.array(x,y,z), ...],
'struc4' : [np.array(x,y,z), np.array(x,y,z), np.array(x,y,z), ...] }
As an example:举个例子:
struc_dict = {
'struc1' : [[-31.447, -4.428, -28.285], [-32.558, -2.108, -29.213], [-31.656, -4.071, -30.89 ], [-33.899, -4.504, -29.349]],
'struc2' : [[-27.487, -15.05, -31.418], [-29.178, -14.63, -33.498], [-29.548, -16.754, -31.937], [-30.028, -14.278, -30.977]],
'struc3' : [[-16.07, -2.042, -29.853], [-16.734, -4.162, -29.905], [-16.279, -4.438, -28.936], [-16.544, -4.098, -31.514]]}
And I would like to find out the shortest distance between each of the structures.我想找出每个结构之间的最短距离。 So I would like to go through the dictionary, grab a pair of values and calculate the shortest distance.
所以我想通过字典go,抓取一对值,计算最短距离。
My current code looks like that, but it's not very pretty or efficient:我当前的代码看起来像这样,但它不是很漂亮或高效:
import numpy as np
for s1 in struc_dict.keys():
for s2 in struc_dict.keys():
# only consider distances between two structures
if s1 == s2:
continue
else:
# defining an arbitrary max value, necessary for the first comparison?
min_dist = 10000
for c1 in struc_dict[s1]:
for c2 in struc_dict[s2]:
# calculates the distance between the two coordinates
if np.linalg.norm(np.array(c1)-np.array(c2)) <= min_dist:
min_dist = np.linalg.norm(np.array(c1)-np.array(c2))
print("Min dist between {s1} & {s2} : {min:.3f} units".format(s1=s1, s2=s2, min=min_dist))
Output for the example:以 Output 为例:
Min dist between struc1 & struc2 : 10.309 units
Min dist between struc1 & struc3 : 14.804 units
Min dist between struc2 & struc1 : 10.309 units
Min dist between struc2 & struc3 : 15.377 units
Min dist between struc3 & struc1 : 14.804 units
Min dist between struc3 & struc2 : 15.377 units
This code works, but calculates the distances between two structures two times, since it has to go through the dictionary twice.此代码有效,但计算两个结构之间的距离两次,因为它必须通过字典两次 go。 Also, I need a large min_dist start value for the first comparison for each two structures, but it there a way around that?
另外,对于每两个结构的第一次比较,我需要一个大的 min_dist 起始值,但是有办法解决吗?
In general, there must be a more elegant solution for that.一般来说,必须有一个更优雅的解决方案。 Thanks!
谢谢!
As for more elegant solution consider itertools.product .至于更优雅的解决方案,请考虑itertools.product 。 Consider following simple example:
考虑以下简单示例:
import itertools
points = {'A': (1,1), 'B': (2,2), 'C': (3,3)}
def dist(a, b):
return ((a[0]-b[0])**2+(a[1]-b[1])**2)**0.5
for p1, p2 in itertools.product(points.keys(), repeat=2):
print('Distance between',p1,'and',p2,'is',dist(points[p1],points[p2]))
Output: Output:
Distance between A and A is 0.0
Distance between A and B is 1.4142135623730951
Distance between A and C is 2.8284271247461903
Distance between B and A is 1.4142135623730951
Distance between B and B is 0.0
Distance between B and C is 1.4142135623730951
Distance between C and A is 2.8284271247461903
Distance between C and B is 1.4142135623730951
Distance between C and C is 0.0
This allows to avoid one nesting level, as opposed to for
inside for
.这允许避免一个嵌套级别,而不是
for
inside for
。
From your first post it was not clear whether the number of coordinates is the same across structures, so I assumed it was not.从您的第一篇文章中,不清楚跨结构的坐标数是否相同,所以我认为不是。
Here is a slightly revised version of your naive approach and a first improved version exploiting the fast low-level vectorization of NumPy.这是您幼稚方法的略微修改版本和第一个改进版本,利用了 NumPy 的快速低级矢量化。
import numpy as np
def naive(data):
res = np.inf
for k1, v1 in data.items():
for k2, v2 in data.items():
if k1 == k2:
continue
for c1 in v1:
for c2 in v2:
res = np.minimum(res, np.sum((c1 - c2)**2))
return np.sqrt(res)
def version1(data):
res = np.inf
for k1, v1 in data.items():
for k2, v2 in data.items():
if k1 == k2:
continue
res = np.minimum(res, np.min(np.sum((v1[None, ...] - v2[:, None, :])**2, axis=-1)))
return np.sqrt(res)
The crucial point is v1[None, ...] - v2[:, None, :]
where, by adding an additional axis to each structure in a different location, we exploit the NumPy broadcasting to remove the two inner loops.关键点是
v1[None, ...] - v2[:, None, :]
其中,通过向不同位置的每个结构添加额外的轴,我们利用 NumPy 广播来删除两个内部循环。
Testing on your data (needs IPython, just to use the simplified interface to timeit
):测试您的数据(需要 IPython,只是为了使用简化的接口
timeit
):
struc_dict = {
'struc1' : [[-31.447, -4.428, -28.285], [-32.558, -2.108, -29.213], [-31.656, -4.071, -30.89 ], [-33.899, -4.504, -29.349]],
'struc2' : [[-27.487, -15.05, -31.418], [-29.178, -14.63, -33.498], [-29.548, -16.754, -31.937], [-30.028, -14.278, -30.977]],
'struc3' : [[-16.07, -2.042, -29.853], [-16.734, -4.162, -29.905], [-16.279, -4.438, -28.936], [-16.544, -4.098, -31.514]]}
data = {k: np.array(v) for k,v in struc_dict.items()}
%timeit naive(data)
%timeit version1(data)
Output: Output:
1000 loops, best of 3: 433 µs per loop
10000 loops, best of 3: 55.7 µs per loop
To better assess performance, let's try with more data:为了更好地评估性能,让我们尝试使用更多数据:
np.random.seed(42)
for n in [10, 20, 50, 100]:
for max_size in [10, 20, 50, 100]:
data = {str(i): np.random.normal(size=[np.random.randint(1, max_size), 3])
for i in range(n)}
print("Measuring for n=%r and max_size=%r" % (n, max_size))
%timeit naive(data)
%timeit version1(data)
Output: Output:
Measuring for n=10 and max_size=10
100 loops, best of 3: 10.6 ms per loop
1000 loops, best of 3: 784 µs per loop
Measuring for n=10 and max_size=20
10 loops, best of 3: 32.5 ms per loop
1000 loops, best of 3: 894 µs per loop
Measuring for n=10 and max_size=50
1 loop, best of 3: 323 ms per loop
1000 loops, best of 3: 1.94 ms per loop
Measuring for n=10 and max_size=100
1 loop, best of 3: 478 ms per loop
100 loops, best of 3: 2.36 ms per loop
Measuring for n=20 and max_size=10
10 loops, best of 3: 43.3 ms per loop
100 loops, best of 3: 3.32 ms per loop
Measuring for n=20 and max_size=20
10 loops, best of 3: 188 ms per loop
100 loops, best of 3: 3.78 ms per loop
Measuring for n=20 and max_size=50
1 loop, best of 3: 1.41 s per loop
100 loops, best of 3: 8.09 ms per loop
Measuring for n=20 and max_size=100
1 loop, best of 3: 4.34 s per loop
100 loops, best of 3: 17.8 ms per loop
Measuring for n=50 and max_size=10
1 loop, best of 3: 341 ms per loop
10 loops, best of 3: 22.6 ms per loop
Measuring for n=50 and max_size=20
1 loop, best of 3: 1.24 s per loop
10 loops, best of 3: 24.3 ms per loop
Measuring for n=50 and max_size=50
1 loop, best of 3: 7.86 s per loop
10 loops, best of 3: 45.9 ms per loop
Measuring for n=50 and max_size=100
1 loop, best of 3: 22.6 s per loop
10 loops, best of 3: 97.1 ms per loop
Measuring for n=100 and max_size=10
1 loop, best of 3: 1.03 s per loop
10 loops, best of 3: 83.5 ms per loop
Measuring for n=100 and max_size=20
1 loop, best of 3: 4 s per loop
10 loops, best of 3: 96.1 ms per loop
Measuring for n=100 and max_size=50
1 loop, best of 3: 27.4 s per loop
10 loops, best of 3: 180 ms per loop
Measuring for n=100 and max_size=100
1 loop, best of 3: 1min 50s per loop
1 loop, best of 3: 447 ms per loop
Speed up varies a lot in function of number of coordinates (the more the better, as long as your machine have enough free RAM). function 坐标数的加速变化很大(越多越好,只要你的机器有足够的空闲内存)。
Further improvements can be achieved by:进一步的改进可以通过以下方式实现:
np.einsum
(Einstein summation) instead of np.sum
, which is well known to be fasternp.einsum
(爱因斯坦求和)而不是np.sum
,这是众所周知的更快
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.