pandas：如何加快数据帧处理

Question

I am doing some distance processing on two dataframes (100k lines and 1M lines).我正在对两个数据帧（100k 行和 1M 行）进行一些距离处理。 My processing takes 20 days at the moment and I would like to see if I can improve my code to speed up the process.我的处理目前需要 20 天，我想看看我是否可以改进我的代码以加快处理速度。 I used geopandas after a suggestion here, which considerably speeded up the sorting in my iteration, but i'm wondering if I could code it differently or some best practices.我在这里提出建议后使用了 geopandas，这大大加快了我迭代中的排序，但我想知道我是否可以对它进行不同的编码或一些最佳实践。

Here is my code:这是我的代码：

    dfb=pd.read_csv(building,sep='#')
    dfb=dfb.astype('string')
    dfo=pd.read_csv(occsol,sep='#')
    dfo=dfo.astype('string')
    dfb['geot']='non'
    gs = gpd.GeoSeries.from_wkt(dfo['geometry'], crs='EPSG:27572')
    gdfo=gpd.GeoDataFrame(dfo,geometry=gs)
    dfb['valeurseuil'] = 3 * ((dfb['surf'] / 3.141592653589793) ** (1 / 2))  # this is a treshold
    m = 0
    fin = len(dfb)
    for i in range(len(dfb)):  
        gdfo['dist']=gdfo['geometry'].distance(Point(dfb.iloc[i]['centro'][0],dfb.iloc[i]['centro'][1]))
        gdfo = gdfo.sort_values(by='dist')
        for j in range(2):  # 3 first polygons sorted by ascending distance
            XYPtj = gdfo.iloc[j]['coordpoints']
            compteur = 0
            temp = []
            for l in XYPtj:
                dist = self.distancepoint([dfb.iloc[i]['centro'], l])
                temp.append(dist)
            for d in temp:
                if d < dfb.iloc[i]['valeurseuil']:  #treshold
                    compteur += 1  #
            if compteur >= 2 and gdfo.iloc[j]['surf'] >= 1.5*dfb.iloc[i]['surf']:
                gdfo.iloc[j]['surf']-= 1.5*dfb.iloc[i]['surf']  # on enleve du potentiel à hypothese 1:1.5
                dfb.iloc[i]['geot'] = 'oui'
        m += 1
        print('avancement : '+ str(m) + '/ ' + str(fin))
    dfb.to_csv('buildingeot',sep='#', index=False)

def distancepoint(self, xy):
    "distance euclidienne,le systeme de coordonnees nest pas precise"
    if self.valeursabs(xy[0][0] - xy[1][0]) < 100000 and self.valeursabs(
            xy[0][1] - xy[1][1]) < 100000:  ##verifier limit
        d = ((xy[1][1] - xy[0][1]) ** 2 + (xy[1][0] - xy[0][0]) ** 2) ** (1 / 2)
    else:
        d = 666666
    return d

Answer 1

Especially for a problem of this size, it's worth looking for vectorized algorithms.特别是对于这种规模的问题，值得寻找矢量化算法。 And for many-to-many matching problems like this, numpy and scipy offer many algorithms which will outperform pandas groupby or looped options by such a significant margin that the extra effort required to manage the indices yourself is usually worth the hassle.对于这样的多对多匹配问题，numpy 和 scipy 提供了许多算法，这些算法将优于 pandas 自己管理索引所需的额外努力通常值得付出额外努力。

There are many ways to approach this, but if your goal is simply to find the nearest point using a euclidian approximation, you can't get much simpler than scipy.spatial.cKDTree .有很多方法可以解决这个问题，但如果您的目标只是使用欧几里得近似值找到最近的点，那么您将无法比scipy.spatial.cKDTree简单得多。 The following code finds the positional index in the second dataset (with 1M rows) for the point which is closest to each of the 100k points in the first dataset, and runs in ~20 seconds:以下代码在第二个数据集中（有 1M 行）中查找最接近第一个数据集中 100k 个点的点的位置索引，并在约 20 秒内运行：

In [1]: import geopandas as gpd, numpy as np, scipy.spatial, pandas as pd

In [2]: # set up GeoDataFrame with 100k random points
   ...: gdf1 = gpd.GeoDataFrame(geometry=gpd.points_from_xy(
   ...:     (np.random.random(size=int(1e5)) * 360 - 180),
   ...:     (np.random.random(size=int(1e5)) * 180 - 90),
   ...: ))

In [3]: # set up GeoDataFrame with 1M random points
   ...: gdf2 = gpd.GeoDataFrame(geometry=gpd.points_from_xy(
   ...:     (np.random.random(size=int(1e6)) * 360 - 180),
   ...:     (np.random.random(size=int(1e6)) * 180 - 90),
   ...: ))


In [4]: %%time
   ...: known_xy = np.stack([gdf2.geometry.x, gdf2.geometry.y], -1)
   ...: tree = scipy.spatial.cKDTree(known_xy)
   ...: 
   ...: query_xy = np.stack([gdf1.geometry.x, gdf1.geometry.y], -1)
   ...: distances, indices = tree.query(query_xy)
   ...:
   ...:
CPU times: user 21 s, sys: 240 ms, total: 21.2 s
Wall time: 22.1 s

pandas：如何加快数据帧处理

问题描述

1 个解决方案

解决方案1
0 2021-11-30 21:48:59

pandas：如何加快数据帧处理

问题描述

1 个解决方案

解决方案1 0 2021-11-30 21:48:59

解决方案1
0 2021-11-30 21:48:59