简体   繁体   English

如何使用Apply优化此代码? (Iterrows)

[英]How can I optimize this code using apply? (Iterrows)

So I have the following dataframes (simplified) 所以我有以下数据框(简化)

    df1 = propslat    prosplong     type
           50     45       prosp1
           34      -25     prosp2


    df2 = complat     complong     type
           29      58      competitor1
           68      34      competitor2

I want to do the following - run a distance calculation for each individual prospects (740k prospects in total) between that prospect and every competitor so theoretically the output would look like the following: 我要执行以下操作-针对该潜在客户与每个竞争对手之间的每个潜在客户(总计74万个潜在客户)进行距离计算,因此从理论上讲,输出结果将如下所示:

    df3 = d_p(x)_to_c1         d_p(x)_to_c2      d_p(x)_to_c3
          234.34                895.34            324.5

where every row of the output is a new prospect. 输出的每一行都是新的前景。

My current code is the following: 我当前的代码如下:

    prospectsarray=[]

    prosparr = []



    for i, row in prospcords.iterrows():
        lat1 = row['prosplat']
        lon2 = row['prosplong']
        coords= [lat1,lon2]
        distancearr2 = []

        for x, row2 in compcords.iterrows():
            lat2 = row2['complat']
            lon2 = row2['complong']
            coords2 = [lat2,lon2]
            distance = geopy.distance.distance(coords, coords2).miles
            if distance > 300:
                distance = 0

            distancearr2.append(distance)
        prosparr.append(distancearr2)
    prospectsarray.extend(prosparr)
    dfprosp = pd.DataFrame(prospectsarray)

While this accomplished my goal, it is horrendously slow. 虽然达到了我的目标,但速度却非常慢。

I have tried the following optimization, but the output is not iterating and still I am using an iterrows which is what I was trying to avoid. 我已经尝试了以下优化,但是输出没有迭代,但是我仍在使用要避免的迭代。

    competitorlist = []
    def distancecalc(df):
        distance_list = []
        for i in range(0, len(prospcords)):
            coords2 = [prospcords.iloc[i]['prosplat'],prospcords.iloc[i]['prosplong']]
            d = geopy.distance.distance(coords1,coords2).miles
            print(d)
            if d>300:
                d=0
            distance_list.append(d)
        competitorlist.append(distance_list)




    for x, row2 in compcords.iterrows():
        lat2 = row2['complat']
        lon2 = row2['complong']
        coords1 = [lat2,lon2]
        distancecalc(prospcords)
        print(distance_list)

My guess is that most of the execution time is spent in geopy.distance.distance(). 我的猜测是,大多数执行时间都花在geopy.distance.distance()中。 You can confirm this by using cProfile or some other timing tool. 您可以使用cProfile或其他计时工具来确认这一点。

According to the geopy documentation on distance , it calculates the geodesic distance between two points, using an ellipsoidal model of the Earth. 根据有关距离的geopy文档,它使用地球的椭圆模型计算两点之间的测地距离。 It appears that this algorithm is very accurate: they compare it to a deprecated algorithm that is "only accurate to 0.2 mm". 看来该算法非常准确:他们将其与“仅精确到0.2毫米”的已弃用算法进行了比较。 My guess is the geodesic distance is a bit time-consuming. 我的猜测是测地距离有点耗时。

They also have a function great_cirlce (geopy.distance.great_circle) which uses a spherical model of the Earth. 它们还具有功能great_cirlce(geopy.distance.great_circle),该函数使用地球的球形模型。 Because the Earth is not a true sphere, this will have "an error of up to about 0.5%." 因为地球不是真正的球体,所以它的“误差约为0.5%”。 So, if the actual distance is 100 (miles/Km), it could be off by as much as a half mile/Km. 因此,如果实际距离为100(英里/公里),则可能会偏离半英里/公里。 Again, just guessing, but I suspect this algorithm is faster than the geodesic algorithm. 同样,只是猜测,但是我怀疑该算法比测地线算法更快。

If you can tolerate the potential errors in your application, try using great_circle() instead of distance() 如果您可以忍受应用程序中的潜在错误,请尝试使用great_circle()而不是distance()

First of all, you should be careful about the information you're giving. 首先,您应该注意所提供的信息。 The dataframes column names you give are not compatible with your code... Also a few explanations would be great about what you are trying to do. 您提供的数据框列名称与您的代码不兼容。另外,一些解释将对您要执行的操作很有帮助。

Anyway, here is my solution: 无论如何,这是我的解决方案:

import pandas as pd
from geopy import distance

compCords = pd.DataFrame(
{'compLat': [20.0, 13.0, 14.0], 'compLong': [-15.0, 5.0, -1.2]})
prospCords = pd.DataFrame(
{'prospLat': [21.0, 12.1, 13.0], 'prospLong': [-14.0, 2.2, 2.0]})


def distanceCalc(compCoord):
    # return the list of result instead of using append() method
    propsDist = prospCords.apply(
        lambda row: distance.distance(
            compCoord, [
                row['prospLat'], row['prospLong']]).miles, axis=1)
    # clean data in a pandas Series
    return propsDist.apply(lambda d: 0. if d > 300 else d)

# Here too return the list through the output
compDist = compCords.apply(lambda row: distanceCalc(
    [row['compLat'], row['compLong']]), axis=1)

dfProsp = pd.DataFrame(compDist)

Note: your problem is that when you use things like apply and functions you should think in a "functional" way: pass most of things you need through inputs and outputs of your functions and do not use tricks like appending elements to global list variables through append or extend functions because those are "side effects" and side effects are not getting along great with functional programming concept like apply function (or 'map' as it is usually called in functional programming). 注意:您的问题是,当您使用诸如apply和function之类的东西时,您应该以“函数式”的方式思考: 通过函数的输入和输出传递您所需的大多数东西,而不要使用诸如通过将元素追加到全局列表变量之类的技巧appendextend函数,因为它们是“副作用”,并且副作用与函数编程概念(例如Apply函数(或在函数编程中通常称为“映射”))相处得并不很好。

Here is the fastest solutin I could make! 这是我能制造的最快的速溶蛋白!

compuid=np.array(df.iloc[0:233,0])
complat = np.array(df.iloc[0:233,3])
complong = np.array(df.iloc[0:233,4])
custlat=np.array(df.iloc[234:,3])
custlong=np.array(df.iloc[234:,4])


ppmmasterlist=[]
mergedlist=[]
for x,y in np.nditer([custlat,custlong]):

    """
    Taking the coords1 from the numpy array's using x,y
    as index and calling those into the coords1 list.
    """
    coords1=[x,y]
    """
    Instatiating Distance collection list
    and List greater than 0
    As well as the pipeline list
    """
    distcoll=[]
    listGreaterThan0=[]
    ppmlist=[]
    ppmdlist=[]
    z=0
    for p,q in np.nditer([complat,complong]):
        """
        Taking the coords2 from the numpy array's using p,q
        as index and calling those into the coords1 list.
        """
        coords2=[p,q]
        distance = great_circle(coords1,coords2).miles
        if distance>= 300:
            distance=0
            di=0
        elif distance <300:
            di=((300-distance)/300)
            distcoll.append(distance)
            distcoll.append(compuid[z])
        if di > 0:
            listGreaterThan0.append(di)
            listGreaterThan0.append(compuid[z])
        if z >= 220:
            ppmlist.append(di)
            ppmdlist.append(distance)
        z+=1
    sumval=[sum(ppmlist)]
    sumval1 = [sum(listGreaterThan0[::2])]
    mergedlist = ppmlist+sumval+ppmdlist+sumval1+listGreaterThan0
    mergedlist.extend(distcoll)
    #rint(mergedlist)
    #ppmmasterlist += [mergedlist]
    ppmmasterlist.append(mergedlist)

df5 = pd.DataFrame(ppmmasterlist)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM