[英]find pair of rows with largest value difference in pandas dataframe
i have a large dataframe containing the x,y,z coordinates of a surface.我有一个包含表面 x、y、z 坐标的大型数据框。 i am looking to find the pair of rows with the largest slope between them (dz/sqrt(dx^2+dy^2))我正在寻找它们之间斜率最大的那对行(dz/sqrt(dx^2+dy^2))
maxGrad = 0
currentGrad = 0
height = 0
for i in range(len(df)):
for j in range(i+1,len(df)):
height = abs(df.z.iloc[j]-df.z.iloc[i])
distance = math.sqrt((df.x.iloc[j]-df.x.iloc[i])**2+(df.y.iloc[j]-df.y.iloc[i])**2)
currentGrad = height/distance
if currentGrad > maxGrad:
maxGrad = currentGrad
maxCoorPair = [df.x.iloc[i],df.y.iloc[i],df.x.iloc[j],df.y.iloc[j]]
print(maxGrad, maxCoorPair)
However this is not very elegant and the run time is very long due to the nested for loop.然而,这不是很优雅,并且由于嵌套的 for 循环,运行时间很长。
How can i do it better?我怎样才能做得更好?
example data:示例数据:
df = pd.DataFrame(
{
"x":[1,5,3],
"y":[12,10,13],
"z":[4,4,1],
}
)
let's define a function which takes a vector of length n, and returns a nxn numpy array where the element in row i and column j is given by the difference of the i th and j th element in the vector让我们定义一个函数,它接受其中行中的元件i和列j是由第i和j的差值在向量给出th元素长度n,并且返回一个n×n个numpy的阵列的载体
def d(vector):
return vector.apply(lambda x: x-vector).values
For example例如
d(df["x"])
gives给
array([[ 0, -4, -2],
[ 4, 0, 2],
[ 2, -2, 0]], dtype=int64)
Plug this method into your formula like so, to perform calcs for all pairs of coordinates at once像这样将此方法插入您的公式中,以一次对所有坐标对执行计算
slopes = d(df["z"])/np.sqrt(np.square(d(df["x"])) + np.square(d(df["y"])))
slopes
looks like this slopes
看起来像这样
array([[ nan, 0. , 1.34164079],
[ 0. , nan, 0.83205029],
[-1.34164079, -0.83205029, nan]])
Note that in the formula, the absolute value of dz is not used.注意公式中没有使用dz的绝对值。 This is deliberate in order to not end up with answers of the form "(a,b) and (b,a)".这是故意的,以免以“(a,b)和(b,a)”形式的答案结束。
We can then use numpy.nanmax
which calculates the maximum of an array, ignoring nan
values, and use numpy.where
to pull out the rows and columns which match the maximum found然后我们可以使用numpy.nanmax
计算数组的最大值,忽略nan
值,并使用numpy.where
拉出与找到的最大值匹配的行和列
cols, rows = np.where(slopes==np.nanmax(slopes))
We can then zip these up to get tuples of coordinates然后我们可以将它们压缩以获得坐标元组
list(zip(rows, cols))
which gives us [(0, 2)]
这给了我们[(0, 2)]
So the largest slope is between the coordinates in rows 0 and 2所以最大的斜率在第 0 行和第 2 行的坐标之间
As often it depends of the size of the problem.通常这取决于问题的大小。
Lets do some investigations让我们做一些调查
Conclusions:结论:
Numpy arrays are faster than even if no Numpy function is used.即使没有使用 Numpy 函数,Numpy 数组也比它快。 . .
The use of numpy functions still accelerates but at the expense of: numpy 函数的使用仍在加速,但代价是:
from collections import namedtuple
import pandas as pd
import numpy as np
import math
Run = namedtuple('run', ['order', 'npa', 'df'])
def generate(maxorder):
run_list = []
for n in range(1, maxorder + 1):
order = n
npa = np.random.random(size=(10**n, 3))
run_list.append(Run(order, npa, pd.DataFrame(npa, columns=list('xyz'))))
return run_list
runs = generate(5)
from functools import wraps
import time
def timeit(func):
@wraps(func)
def timed(*args, **kw):
start_time = time.time()
result = func(*args, **kw)
end_time = time.time()
print(f"{func.__name__} {(end_time - start_time) * 1000} ms")
return result
return timed
def prints(f, data, max_order):
for order,npa,df in runs[:max_order]:
print()
print("Order:", order)
maxGrad, maxCoorPair = f(df if data == 'df' else npa)
print("maxGrad:", maxGrad)
print("Pair:")
print(maxCoorPair)
@timeit
def sol_original(df):
maxGrad = 0
currentGrad = 0
height = 0
for i in range(len(df)):
for j in range(i+1,len(df)):
height = abs(df.z.iloc[j]-df.z.iloc[i])
distance = math.sqrt((df.x.iloc[j]-df.x.iloc[i])**2+(df.y.iloc[j]-df.y.iloc[i])**2)
currentGrad = height/distance
if currentGrad > maxGrad:
maxGrad = currentGrad
maxCoorPair = [(df.x.iloc[i],df.y.iloc[i]),(df.x.iloc[j],df.y.iloc[j])]
return maxGrad, maxCoorPair
prints(sol_original, 'df', 3)
Order: 1
sol_original 2.979278564453125 ms
maxGrad: 4.274082602280762
Pair:
[(0.08217955028694601, 0.9160537098844143), (0.2396284679279188, 0.8196073645585937)]
Order: 2
sol_original 264.29247856140137 ms
maxGrad: 167.5282986999116
Pair:
[(0.09926115331767238, 0.7497707080285022), (0.09615387652529517, 0.7469231082687531)]
Order: 3
sol_original 24213.2625579834 ms
maxGrad: 1246.4073209631038
Pair:
[(0.8285207768494603, 0.016839864860434428), (0.8285102753052541, 0.016228726039262287)]
@timeit
def sol_numpy10(npa):
maxGrad = 0
currentGrad = 0
height = 0
for i in range(len(npa)):
for j in range(i+1,len(npa)):
height = abs(npa[j][2]-npa[i][2])
distance = math.sqrt((npa[j][0]-npa[i][0])**2+(npa[j][1]-npa[i][1])**2)
currentGrad = height/distance
if currentGrad > maxGrad:
maxGrad = currentGrad
maxCoorPair = [(npa[i][0],npa[i][1]),(npa[j][0],npa[j][1])]
return maxGrad, maxCoorPair
prints(sol_numpy10, 'np', 3)
Order: 1
sol_numpy10 0.0 ms
maxGrad: 4.274082602280762
Pair:
[(0.08217955028694601, 0.9160537098844143), (0.2396284679279188, 0.8196073645585937)]
Order: 2
sol_numpy10 13.963699340820312 ms
maxGrad: 167.5282986999116
Pair:
[(0.09926115331767238, 0.7497707080285022), (0.09615387652529517, 0.7469231082687531)]
Order: 3
sol_numpy10 998.3282089233398 ms
maxGrad: 1246.4073209631038
Pair:
[(0.8285207768494603, 0.016839864860434428), (0.8285102753052541, 0.016228726039262287)]
@timeit
def sol_numpy11(npa):
maxGrad = 0
currentGrad = 0
height = 0
for i in range(len(npa)):
xi,yi,zi = npa[i]
for j in range(i+1,len(npa)):
height = abs(npa[j][2]-zi)
distance = math.sqrt((npa[j][0]-xi)**2+(npa[j][1]-yi)**2)
currentGrad = height/distance
if currentGrad > maxGrad:
maxGrad = currentGrad
maxCoorPair = [(xi,yi),(npa[j][0],npa[j][1])]
return(maxGrad, maxCoorPair)
prints(sol_numpy11, 'np', 3)
Order: 1
sol_numpy11 0.0 ms
maxGrad: 4.274082602280762
Pair:
[(0.08217955028694601, 0.9160537098844143), (0.2396284679279188, 0.8196073645585937)]
Order: 2
sol_numpy11 10.002613067626953 ms
maxGrad: 167.5282986999116
Pair:
[(0.09926115331767238, 0.7497707080285022), (0.09615387652529517, 0.7469231082687531)]
Order: 3
sol_numpy11 771.9049453735352 ms
maxGrad: 1246.4073209631038
Pair:
[(0.8285207768494603, 0.016839864860434428), (0.8285102753052541, 0.016228726039262287)]
@timeit
def sol_numpy12(npa):
maxGrad = 0
currentGrad = 0
height = 0
for i in range(len(npa)):
xi,yi,zi = npa[i]
for j in range(i+1,len(npa)):
xj,yj,zj = npa[j]
height = abs(zj-zi)
distance = math.sqrt((xj-xi)**2+(yj-yi)**2)
currentGrad = height/distance
if currentGrad > maxGrad:
maxGrad = currentGrad
maxCoorPair = [(xi,yi),(xj,yj)]
return(maxGrad, maxCoorPair)
prints(sol_numpy12, 'np', 3)
Order: 1
sol_numpy12 0.0 ms
maxGrad: 4.274082602280762
Pair:
[(0.08217955028694601, 0.9160537098844143), (0.2396284679279188, 0.8196073645585937)]
Order: 2
sol_numpy12 8.97526741027832 ms
maxGrad: 167.5282986999116
Pair:
[(0.09926115331767238, 0.7497707080285022), (0.09615387652529517, 0.7469231082687531)]
Order: 3
sol_numpy12 1075.1206874847412 ms
maxGrad: 1246.4073209631038
Pair:
[(0.8285207768494603, 0.016839864860434428), (0.8285102753052541, 0.016228726039262287)]
Riley's answer莱利的回答
def d(vector):
return vector.apply(lambda x: x-vector).values
@timeit
def sol_numpy20(df):
slopes = np.abs(d(df["z"]))/np.sqrt(np.square(d(df["x"])) + np.square(d(df["y"])))
maxGrad = np.nanmax(slopes)
ind1, ind2 = np.where(slopes==maxGrad)
maxCoorPair = [df.iloc[ind2]]
return maxGrad, maxCoorPair
prints(sol_numpy20, 'df', 4)
Order: 1
10
sol_numpy20 6.981372833251953 ms
maxGrad: 4.274082602280762
Pair:
[ x y z
6 0.239628 0.819607 0.867972
5 0.082180 0.916054 0.078804]
Order: 2
100
sol_numpy20 42.88601875305176 ms
maxGrad: 167.5282986999116
Pair:
[ x y z
95 0.096154 0.746923 0.135159
45 0.099261 0.749771 0.841247]
Order: 3
1000
<ipython-input-201-ce6f8e091e1b>:7: RuntimeWarning: invalid value encountered in true_divide
slopes = np.abs(d(df["z"]))/np.sqrt(np.square(d(df["x"])) + np.square(d(df["y"])))
<ipython-input-201-ce6f8e091e1b>:7: RuntimeWarning: invalid value encountered in true_divide
slopes = np.abs(d(df["z"]))/np.sqrt(np.square(d(df["x"])) + np.square(d(df["y"])))
<ipython-input-201-ce6f8e091e1b>:7: RuntimeWarning: invalid value encountered in true_divide
slopes = np.abs(d(df["z"]))/np.sqrt(np.square(d(df["x"])) + np.square(d(df["y"])))
sol_numpy20 625.3554821014404 ms
maxGrad: 1246.4073209631038
Pair:
[ x y z
991 0.828510 0.016229 0.893811
240 0.828521 0.016840 0.131971]
Order: 4
10000
<ipython-input-201-ce6f8e091e1b>:7: RuntimeWarning: invalid value encountered in true_divide
slopes = np.abs(d(df["z"]))/np.sqrt(np.square(d(df["x"])) + np.square(d(df["y"])))
sol_numpy20 12746.904850006104 ms
maxGrad: 4346.72025021119
Pair:
[ x y z
3751 0.538723 0.168160 0.131924
1063 0.538800 0.168028 0.798256]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.