[英]Running an external function within Pandas Dataframe to speed up processing loops
Good Day Peeps,好日子偷看,
I currently have 2 data frames, "Locations" and "Pokestops", both containing a list of coordinates.我目前有 2 个数据框,“Locations”和“Pokestops”,都包含坐标列表。 The goal with these 2 data frames, is to cluster points from "Pokestops" that are within 70m of the points in "Locations".
这两个数据帧的目标是从“位置”中的点 70m 范围内的“Pokestops”中聚类点。
I have created a "Brute Force" clustering script.我创建了一个“蛮力”聚类脚本。
The process is as follows:过程如下:
for i in range(len(locations)-1, -1, -1):
array = []
for f in range(0, len(pokestops)):
if geopy.distance.geodesic(locations.iloc[i, 2], pokestops.iloc[f, 2]).m <= 70:
array.append(f)
if len(array) <= 0:
locations.drop([i], inplace=True)
else:
locations.iat[i, 3] = array
locations["Length"] = locations["Pokestops"].map(len)
This results in:这导致:
Lat Long Coordinates Pokestops Length
2 -33.916432 18.426188 -33.916432,18.4261883 [1] 1
3 -33.916432 18.426287 -33.916432,18.42628745 [1] 1
4 -33.916432 18.426387 -33.916432,18.4263866 [1] 1
5 -33.916432 18.426486 -33.916432,18.42648575 [0, 1] 2
6 -33.916432 18.426585 -33.916432,18.4265849 [0, 1] 2
7 -33.916432 18.426684 -33.916432,18.426684050000002 [0, 1] 2
locations.sort_values("Length", ascending=False, inplace=True)
This results in:这导致:
Lat Long Coordinates Pokestops Length
136 -33.915441 18.426585 -33.91544050000003,18.4265849 [1, 2, 3, 4] 4
149 -33.915341 18.426585 -33.915341350000034,18.4265849 [1, 2, 3, 4] 4
110 -33.915639 18.426585 -33.915638800000025,18.4265849 [1, 2, 3, 4] 4
111 -33.915639 18.426684 -33.915638800000025,18.426684050000002 [1, 2, 3, 4] 4
stops = list(locations['Pokestops'])
seen = list(locations.iloc[0, 3])
stops_filtered = [seen]
for xx in stops[1:]:
xx = [x for x in xx if x not in seen]
stops_filtered.append(xx)
locations['Pokestops'] = stops_filtered
This results in:这导致:
Lat Long Coordinates Pokestops Length
136 -33.915441 18.426585 -33.91544050000003,18.4265849 [1, 2, 3, 4] 4
149 -33.915341 18.426585 -33.915341350000034,18.4265849 [] 4
110 -33.915639 18.426585 -33.915638800000025,18.4265849 [] 4
111 -33.915639 18.426684 -33.915638800000025,18.426684050000002 [] 4
locations = locations[locations['Pokestops'].map(len)>0]
This results in:这导致:
Lat Long Coordinates Pokestops Length
136 -33.915441 18.426585 -33.91544050000003,18.4265849 [1, 2, 3, 4] 4
176 -33.915143 18.426684 -33.91514305000004,18.426684050000002 [5] 3
180 -33.915143 18.427081 -33.91514305000004,18.427080650000004 [5] 3
179 -33.915143 18.426982 -33.91514305000004,18.426981500000004 [5] 3
clusters = np.append(clusters, locations.iloc[0 , 0:2])
This results in:这导致:
Lat Long Coordinates Pokestops Length
176 -33.915143 18.426684 -33.91514305000004,18.426684050000002 [5] 3
180 -33.915143 18.427081 -33.91514305000004,18.427080650000004 [5] 3
179 -33.915143 18.426982 -33.91514305000004,18.426981500000004 [5] 3
64 -33.916035 18.427180 -33.91603540000001,18.427179800000005 [0] 3
This all results in an array containing all coordinates of points from the Locations dataframe, that contain points within 70m from Pokestops, sorted from Largest to Smallest cluster.这一切都会产生一个数组,其中包含来自 Locations 数据帧的所有点坐标,其中包含距离 Pokestop 70m 范围内的点,从最大到最小集群排序。
Now for the actual question.现在来回答实际问题。
The method I am using in steps 1-3, results in needing to loop a few million times for a small-medium dataset.我在步骤 1-3 中使用的方法导致需要为中小型数据集循环几百万次。
I believe I can achieve faster times by migrating away from using the "for" loops and allowing Pandas to calculate the distances between the two tables "Directly" using the geopy.distance.geodesic function.我相信我可以通过放弃使用“for”循环并允许 Pandas 使用 geopy.distance.geodesic 函数“直接”计算两个表之间的距离来实现更快的时间。
I am just unsure how to even approach this...我只是不确定如何处理这个......
I know there is a library called GeoPandas, but this requires conda, and will mean I need to step away from being able to use my arrays/lists in the column Locations["Pokestops"].我知道有一个名为 GeoPandas 的库,但这需要 conda,这意味着我需要放弃在 Locations["Pokestops"] 列中使用我的数组/列表。 (I also have 0 knowlege on how to use GeoPandas to be fair)
(我对如何公平地使用 GeoPandas 也有 0 知识)
I know very broad questions like this are generally shunned, but I am fully self-taught in python, trying to achieve what is most likely too complicated of a script for my level.我知道像这样的非常广泛的问题通常会被回避,但我完全是在 python 自学成才,试图实现对我的水平来说最有可能过于复杂的脚本。
I've made it this far, I just need this last step to make it more efficient.我已经做到了这一点,我只需要最后一步来提高效率。 The script is fully working, and provides the required results, it simply takes too long to run due to the nested for loops.
该脚本完全正常工作,并提供了所需的结果,由于嵌套的 for 循环,运行时间太长。
Any advise/ideas are greatly appreciated, and keep in mind my knowlege on python/Pandas is somewhat limited and i do not know all the functions/terminology.非常感谢任何建议/想法,请记住,我对 python/Pandas 的了解有些有限,而且我不知道所有的功能/术语。
Thank you @Finn, although this solution has caused me to significantly alter my main body, this is working as intended.谢谢@Finn,虽然这个解决方案导致我显着改变了我的主体,但这是按预期工作的。
With the new matrix, I am filtering everything> 0.07 to be NaN.使用新矩阵,我将所有 > 0.07 的内容过滤为 NaN。
Lat Long Count 0 1 2 3 4
82 -33.904620 18.402612 5 NaN NaN NaN 0.052401 NaN
75 -33.904620 18.400183 5 NaN NaN NaN NaN 0.053687
120 -33.903579 18.401224 5 NaN NaN NaN NaN NaN
68 -33.904967 18.402612 5 NaN 0.044402 NaN 0.015147 NaN
147 -33.902885 18.400877 5 NaN NaN NaN NaN NaN
89 -33.904273 18.400183 5 NaN NaN NaN NaN NaN
182 -33.901844 18.398448 4 NaN NaN NaN NaN NaN
54 -33.905314 18.402612 4 NaN 0.020793 NaN 0.026215 NaN
183 -33.901844 18.398795 4 NaN NaN NaN NaN NaN
184 -33.901844 18.399142 4 NaN NaN NaN NaN NaN
The problem I face now is step 5 in my original post.我现在面临的问题是我原来帖子中的第 5 步。
Can you advise how I would go about removing all columns that do NOT contain NaN in the 1st row?您能建议我如何删除第一行中不包含 NaN 的所有列吗?
The only info I can find is removing columns if ANY value in any row is not NaN.. I have tried every combination of .dropna() I could find online.如果任何行中的任何值不是 NaN,我能找到的唯一信息是删除列。我已经尝试了我可以在网上找到的所有 .dropna() 组合。
The Apply function Might be helpful. Apply 功能可能会有所帮助。 The Apply function applies the specified function to every cell of the Dataset (Of course you can control the parameters).
Apply 函数将指定的函数应用于 Dataset 的每个单元格(当然您可以控制参数)。 Check this documentation ( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html ) for further understanding.
检查此文档( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html )以进一步了解。
I do believe loops will be very chaotic to implement once the solution hides in multiple layers.
我确实相信一旦解决方案隐藏在多个层中,循环将非常混乱。 From the perspective of playing with datasets it is far better without a loop approach and apply functions as here we are expected to provide quick solutions.
从使用数据集的角度来看,没有循环方法和应用函数要好得多,因为我们希望在这里提供快速的解决方案。
I don't fully understand your code, but from your text there are a few things you can do to speed this up.我不完全理解您的代码,但是从您的文字中可以采取一些措施来加快速度。 I think the most beneficial thing to do is to vectorize your distance calculations, because looping to every combination takes forever.
我认为最有益的做法是将距离计算矢量化,因为循环到每个组合需要很长时间。 So i stole the calculation from this answer and adopted it to create a matrix:
所以我从这个答案中窃取了计算并采用它来创建一个矩阵:
import numpy as np
import pandas as pd
pokestops = pd.read_csv('Pokestops.txt', header=None)
pokestops.columns = ["Lat", "Long"]
locations = pd.read_csv('Locations.txt')
locations.columns = ["Lat", "Long"]
def haversine_np_matrix(lon1, lat1, lon2, lat2):
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
lon1 = np.expand_dims(lon1, axis=0)
lat1 = np.expand_dims(lat1, axis=0)
lon2 = np.expand_dims(lon2, axis=1)
lat2 = np.expand_dims(lat2, axis=1)
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
c = 2 * np.arcsin(np.sqrt(a))
km = 6367 * c
return km
distances = haversine_np_matrix(pokestops["Long"],pokestops["Lat"], locations["Long"],locations['Lat'])
This gives you the distance from each Location to each Pokestop.这为您提供了从每个位置到每个 Pokestop 的距离。 Now you can use something like
distances < 0.07
to find all that are closer than 70 meter.现在您可以使用
distances < 0.07
类的东西来查找所有距离小于 70 米的东西。 For this to work i stripped Location.txt of everything but Long and Lat.为此,我删除了 Location.txt 中的所有内容,但 Long 和 Lat 除外。 I am not sure
meters = 10
and degrees = 0.000009915
in your text do, so you may have to adapt the calculation and you may want to compare the 6367km to the calculation from geopy.distance.geodesic
as described here to get the same results.我不确定您的文本中的
meters = 10
和degrees = 0.000009915
,因此您可能需要调整计算,并且您可能希望将 6367 公里与geopy.distance.geodesic
的计算进行比较,如此 处所述以获得相同的结果。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.