简体   繁体   English

在 Pandas Dataframe 中运行外部函数以加快处理循环

[英]Running an external function within Pandas Dataframe to speed up processing loops

Good Day Peeps,好日子偷看,

I currently have 2 data frames, "Locations" and "Pokestops", both containing a list of coordinates.我目前有 2 个数据框,“Locations”和“Pokestops”,都包含坐标列表。 The goal with these 2 data frames, is to cluster points from "Pokestops" that are within 70m of the points in "Locations".这两个数据帧的目标是从“位置”中的点 70m 范围内的“Pokestops”中聚类点。

I have created a "Brute Force" clustering script.我创建了一个“蛮力”聚类脚本。

The process is as follows:过程如下:

  1. Calculate which "Pokestops" are within 70m of each point in "Locations".计算“位置”中每个点的 70m 范围内有哪些“Pokestop”。
  2. Add all nearby Pokestops to Locations["Pokestops"], as a list/array of their index value eg, ([0, 4, 22])将所有附近的 Pokestops 添加到 Locations["Pokestops"],作为它们的索引值的列表/数组,例如 ([0, 4, 22])
  3. If no Pokestops are near a point in "Locations", remove that line from the Locations df如果“位置”中的某个点附近没有 Pokestop,请从位置 df 中删除该行
for i in range(len(locations)-1, -1, -1):
    array = []
    for f in range(0, len(pokestops)):
        if geopy.distance.geodesic(locations.iloc[i, 2], pokestops.iloc[f, 2]).m <= 70:
            array.append(f)
    if len(array) <= 0:
        locations.drop([i], inplace=True)
    else:
        locations.iat[i, 3] = array
        locations["Length"] = locations["Pokestops"].map(len)

This results in:这导致:

           Lat       Long                             Coordinates     Pokestops  Length
2   -33.916432  18.426188                   -33.916432,18.4261883           [1]       1
3   -33.916432  18.426287                  -33.916432,18.42628745           [1]       1
4   -33.916432  18.426387                   -33.916432,18.4263866           [1]       1
5   -33.916432  18.426486                  -33.916432,18.42648575        [0, 1]       2
6   -33.916432  18.426585                   -33.916432,18.4265849        [0, 1]       2
7   -33.916432  18.426684           -33.916432,18.426684050000002        [0, 1]       2
  1. Sort by most to least amount of pokestops within 70m.按 70m 内的 pokestop 数量从多到少排序。
locations.sort_values("Length", ascending=False, inplace=True)

This results in:这导致:

           Lat       Long                             Coordinates     Pokestops  Length
136 -33.915441  18.426585           -33.91544050000003,18.4265849  [1, 2, 3, 4]       4
149 -33.915341  18.426585          -33.915341350000034,18.4265849  [1, 2, 3, 4]       4
110 -33.915639  18.426585          -33.915638800000025,18.4265849  [1, 2, 3, 4]       4
111 -33.915639  18.426684  -33.915638800000025,18.426684050000002  [1, 2, 3, 4]       4
  1. Remove all index values listed in Locations[0, "Pokestops"], from all other rows Locations[1:, "Pokestops"]从所有其他行 Locations[1:, "Pokestops"] 中删除 Locations[0, "Pokestops"] 中列出的所有索引值
    stops = list(locations['Pokestops'])
    seen = list(locations.iloc[0, 3])
    stops_filtered = [seen]
    for xx in stops[1:]:
        xx = [x for x in xx if x not in seen]
        stops_filtered.append(xx)
    locations['Pokestops'] = stops_filtered

This results in:这导致:

           Lat       Long                             Coordinates     Pokestops  Length
136 -33.915441  18.426585           -33.91544050000003,18.4265849  [1, 2, 3, 4]       4
149 -33.915341  18.426585          -33.915341350000034,18.4265849            []       4
110 -33.915639  18.426585          -33.915638800000025,18.4265849            []       4
111 -33.915639  18.426684  -33.915638800000025,18.426684050000002            []       4
  1. Remove all empty rows in Locations["Pokestops]删除 Locations["Pokestops] 中的所有空行
locations = locations[locations['Pokestops'].map(len)>0]

This results in:这导致:

           Lat       Long                             Coordinates     Pokestops  Length
136 -33.915441  18.426585           -33.91544050000003,18.4265849  [1, 2, 3, 4]       4
176 -33.915143  18.426684   -33.91514305000004,18.426684050000002           [5]       3
180 -33.915143  18.427081   -33.91514305000004,18.427080650000004           [5]       3
179 -33.915143  18.426982   -33.91514305000004,18.426981500000004           [5]       3
  1. Add Locations[0, "Coordinates"] to an array that can be saved to .txt later, which will form our final list of "Clustered" coordinates.将 Locations[0, "Coordinates"] 添加到可以稍后保存到 .txt 的数组中,这将形成我们最终的“集群”坐标列表。
clusters = np.append(clusters, locations.iloc[0 , 0:2])

This results in:这导致:

           Lat       Long                             Coordinates Pokestops  Length
176 -33.915143  18.426684   -33.91514305000004,18.426684050000002       [5]       3
180 -33.915143  18.427081   -33.91514305000004,18.427080650000004       [5]       3
179 -33.915143  18.426982   -33.91514305000004,18.426981500000004       [5]       3
64  -33.916035  18.427180   -33.91603540000001,18.427179800000005       [0]       3
  1. Repeat the process from 4-7 till the Locations df is empty.重复 4-7 的过程,直到 Locations df 为空。

This all results in an array containing all coordinates of points from the Locations dataframe, that contain points within 70m from Pokestops, sorted from Largest to Smallest cluster.这一切都会产生一个数组,其中包含来自 Locations 数据帧的所有点坐标,其中包含距离 Pokestop 70m 范围内的点,从最大到最小集群排序。

Now for the actual question.现在来回答实际问题。

The method I am using in steps 1-3, results in needing to loop a few million times for a small-medium dataset.我在步骤 1-3 中使用的方法导致需要为中小型数据集循环几百万次。

I believe I can achieve faster times by migrating away from using the "for" loops and allowing Pandas to calculate the distances between the two tables "Directly" using the geopy.distance.geodesic function.我相信我可以通过放弃使用“for”循环并允许 Pandas 使用 geopy.distance.geodesic 函数“直接”计算两个表之间的距离来实现更快的时间。

I am just unsure how to even approach this...我只是不确定如何处理这个......

  • How do I get it to iterate through rows without using a for loop?如何在不使用 for 循环的情况下迭代行?
  • How do I maintain using my "lists/arrays" in my locations["Pokestops"] column?如何在我的位置 [“Pokestops”] 列中使用我的“列表/数组”?
  • Will it even be faster?它会更快吗?

I know there is a library called GeoPandas, but this requires conda, and will mean I need to step away from being able to use my arrays/lists in the column Locations["Pokestops"].我知道有一个名为 GeoPandas 的库,但这需要 conda,这意味着我需要放弃在 Locations["Pokestops"] 列中使用我的数组/列表。 (I also have 0 knowlege on how to use GeoPandas to be fair) (我对如何公平地使用 GeoPandas 也有 0 知识)

I know very broad questions like this are generally shunned, but I am fully self-taught in python, trying to achieve what is most likely too complicated of a script for my level.我知道像这样的非常广泛的问题通常会被回避,但我完全是在 python 自学成才,试图实现对我的水平来说最有可能过于复杂的脚本。

I've made it this far, I just need this last step to make it more efficient.我已经做到了这一点,我只需要最后一步来提高效率。 The script is fully working, and provides the required results, it simply takes too long to run due to the nested for loops.该脚本完全正常工作,并提供了所需的结果,由于嵌套的 for 循环,运行时间太长。

Any advise/ideas are greatly appreciated, and keep in mind my knowlege on python/Pandas is somewhat limited and i do not know all the functions/terminology.非常感谢任何建议/想法,请记住,我对 python/Pandas 的了解有些有限,而且我不知道所有的功能/术语。

EDIT #1:编辑#1:

Thank you @Finn, although this solution has caused me to significantly alter my main body, this is working as intended.谢谢@Finn,虽然这个解决方案导致我显着改变了我的主体,但这是按预期工作的。

With the new matrix, I am filtering everything> 0.07 to be NaN.使用新矩阵,我将所有 > 0.07 的内容过滤为 NaN。

          Lat       Long  Count   0         1   2         3         4
82  -33.904620  18.402612      5 NaN       NaN NaN  0.052401       NaN
75  -33.904620  18.400183      5 NaN       NaN NaN       NaN  0.053687
120 -33.903579  18.401224      5 NaN       NaN NaN       NaN       NaN
68  -33.904967  18.402612      5 NaN  0.044402 NaN  0.015147       NaN
147 -33.902885  18.400877      5 NaN       NaN NaN       NaN       NaN
89  -33.904273  18.400183      5 NaN       NaN NaN       NaN       NaN
182 -33.901844  18.398448      4 NaN       NaN NaN       NaN       NaN
54  -33.905314  18.402612      4 NaN  0.020793 NaN  0.026215       NaN
183 -33.901844  18.398795      4 NaN       NaN NaN       NaN       NaN
184 -33.901844  18.399142      4 NaN       NaN NaN       NaN       NaN

The problem I face now is step 5 in my original post.我现在面临的问题是我原来帖子中的第 5 步。

Can you advise how I would go about removing all columns that do NOT contain NaN in the 1st row?您能建议我如何删除第一行中不包含 NaN 的所有列吗?

The only info I can find is removing columns if ANY value in any row is not NaN.. I have tried every combination of .dropna() I could find online.如果任何行中的任何值不是 NaN,我能找到的唯一信息是删除列。我已经尝试了我可以在网上找到的所有 .dropna() 组合。

The Apply function Might be helpful. Apply 功能可能会有所帮助。 The Apply function applies the specified function to every cell of the Dataset (Of course you can control the parameters). Apply 函数将指定的函数应用于 Dataset 的每个单元格(当然您可以控制参数)。 Check this documentation ( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html ) for further understanding.检查此文档( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html )以进一步了解。

I do believe loops will be very chaotic to implement once the solution hides in multiple layers.我确实相信一旦解决方案隐藏在多个层中,循环将非常混乱。 From the perspective of playing with datasets it is far better without a loop approach and apply functions as here we are expected to provide quick solutions.从使用数据集的角度来看,没有循环方法和应用函数要好得多,因为我们希望在这里提供快速的解决方案。

I don't fully understand your code, but from your text there are a few things you can do to speed this up.我不完全理解您的代码,但是从您的文字中可以采取一些措施来加快速度。 I think the most beneficial thing to do is to vectorize your distance calculations, because looping to every combination takes forever.我认为最有益的做法是将距离计算矢量化,因为循环到每个组合需要很长时间。 So i stole the calculation from this answer and adopted it to create a matrix:所以我从这个答案中窃取了计算并采用它来创建一个矩阵:

import numpy as np
import pandas as pd

pokestops = pd.read_csv('Pokestops.txt', header=None)
pokestops.columns = ["Lat", "Long"]
locations = pd.read_csv('Locations.txt')
locations.columns = ["Lat", "Long"]

def haversine_np_matrix(lon1, lat1, lon2, lat2):
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    lon1 = np.expand_dims(lon1, axis=0)
    lat1 = np.expand_dims(lat1, axis=0)
    lon2 = np.expand_dims(lon2, axis=1)
    lat2 = np.expand_dims(lat2, axis=1)

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

    c = 2 * np.arcsin(np.sqrt(a))
    km = 6367 * c
    return km

distances = haversine_np_matrix(pokestops["Long"],pokestops["Lat"], locations["Long"],locations['Lat'])

This gives you the distance from each Location to each Pokestop.这为您提供了从每个位置到每个 Pokestop 的距离。 Now you can use something like distances < 0.07 to find all that are closer than 70 meter.现在您可以使用distances < 0.07类的东西来查找所有距离小于 70 米的东西。 For this to work i stripped Location.txt of everything but Long and Lat.为此,我删除了 Location.txt 中的所有内容,但 Long 和 Lat 除外。 I am not sure meters = 10 and degrees = 0.000009915 in your text do, so you may have to adapt the calculation and you may want to compare the 6367km to the calculation from geopy.distance.geodesic as described here to get the same results.我不确定您的文本中的meters = 10degrees = 0.000009915 ,因此您可能需要调整计算,并且您可能希望将 6367 公里与geopy.distance.geodesic的计算进行比较,如此 所述以获得相同的结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM