[英]Can I use lambda, map, apply, or applymap to fill a dataframe?
This is a simplified version of my data. 这是我数据的简化版本。 I have a dataframe of coordinates, and an empty dataframe which should be filled with the distance of each pair using the function provided.
我有一个坐标数据框和一个空数据框,应使用提供的函数填充每对的距离。
What is the quickest method to fill this dataframe? 填充此数据帧的最快方法是什么? As much as possible, I want to stay away from nested for loops (slow!).
尽可能地,我想远离嵌套for循环(慢!)。 Can I use apply or applymap?
我可以使用apply或applymap吗? You may modify the function or other parts accordingly.
您可以相应地修改功能或其他部分。 Thanks.
谢谢。
import pandas as pd
def get_distance(point1, point2):
"""Gets the coordinates of two points as two lists, and outputs their distance"""
return (((point1[0] - point2[0]) ** 2 + (point1[1] - point2[1]) ** 2 + (point1[2] - point2[2]) ** 2) ** 0.5)
#Dataframe of coordinates.
df = pd.DataFrame({"No.": [25, 36, 70, 95, 112, 101, 121, 201], "x": [1,2,3,4,2,3,4,5], "y": [2,3,4,5,3,4,5,6], "z": [3,4,5,6,4,5,6,7]})
df.set_index("No.", inplace = True)
#Dataframe to be filled with each pair distance.
df_dist = pd.DataFrame({'target': [112, 101, 121, 201]}, columns=["target", 25, 36, 70, 95])
df_dist.set_index("target", inplace = True)
If you don't want to use for loops, you can compute the distances between all the possible pairs in the following way. 如果您不想使用for循环,则可以通过以下方式计算所有可能对之间的距离。
You first need to do the cartesian product of df
with itself to have all the possible pairs of point. 你首先需要自己做
df
的笛卡尔积,得到所有可能的点对。
i, j = np.where(1 - np.eye(len(df)))
df=df.iloc[i].reset_index(drop=True).join(
df.iloc[j].reset_index(drop=True), rsuffix='_2')
Where i
and j
are the boolean indexes of the upper and lower triangles of a square matrix of size len(df)
. 其中
i
和j
是大小为len(df)
方阵的上下三角形的布尔索引。 After you did this you just need to apply your distance function 完成此操作后,您只需应用距离函数即可
df['distance'] = get_distance([df['x'],df['y'],df['z']], [df['x_2'],df['y_2'],df['z_2']])
df.head()
No. x y z No._2 x_2 y_2 z_2 distance
0 25 1 2 3 36 2 3 4 1.732051
1 25 1 2 3 70 3 4 5 3.464102
2 25 1 2 3 95 4 5 6 5.196152
3 25 1 2 3 112 2 3 4 1.732051
4 25 1 2 3 101 3 4 5 3.464102
If you wanted to compute only the points from df_dist you can modify accordingly the matrix 1 - np.eye(len(df))
. 如果你只想计算df_dist中的点,你可以相应地修改矩阵
1 - np.eye(len(df))
。
AFAIK there are no clear speed benefit of lambda over a for loop - and it's very hard to write a double lambda, usually that is reserved for straightforward row operations. AFAIK对于for循环没有明确的lambda速度优势 - 并且编写双lambda非常困难,通常是为简单的行操作保留的。
However with some engineering, we can reduce our code to a few simple and self explanatory lines: 但是通过一些工程,我们可以将代码简化为一些简单明了的解释:
import numpy as np
get = lambda i: df.loc[i,:].values
dist = lambda i, j: np.sqrt(sum((get(i) - get(j))**2))
# Fills your df_dist
for i in df_dist.columns:
for j in df_dist.index:
df_dist.loc[j,i] = dist(i, j)
The resulting df_dist
: 由此产生的
df_dist
:
25 36 70 95
target
112 1.732051 0.000000 1.732051 3.464102
101 3.464102 1.732051 0.000000 1.732051
121 5.196152 3.464102 1.732051 0.000000
201 6.928203 5.196152 3.464102 1.732051
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.