与两个大型Pandas Dataframe迭代相比，提高了效率

Question

我有两个带有基于位置的值的巨型熊猫数据帧，我需要用来自df2的记录数来更新df1 ['count']，这些记录距离df1中的每个点小于1000米。

这是我导入Pandas的数据示例

df1 =       lat      long    valA   count
        0   123.456  986.54  1      0
        1   223.456  886.54  2      0
        2   323.456  786.54  3      0
        3   423.456  686.54  2      0
        4   523.456  586.54  1      0

df2 =       lat      long    valB
        0   123.456  986.54  1
        1   223.456  886.54  2
        2   323.456  786.54  3
        3   423.456  686.54  2
        4   523.456  586.54  1

实际上，df1有大约1000万行，而df2有大约100万行

我使用Pandas DF.itertuples（）方法创建了一个工作嵌套的FOR循环，该方法适用于较小的测试数据集（df1 = 1k Rows＆df2 = 100行需要大约一个小时才能完成），但完整的数据集是指数级的更大，并将根据我的计算需要数年才能完成。 这是我的工作代码......

import pandas as pd
import geopy.distance as gpd

file1 = 'C:\\path\\file1.csv'    
file2 = 'C:\\path\\file2.csv' 

df1 = pd.read_csv(file1)
df2 = pd.read_csv(file2)

df1.sort_values(['long', 'lat']), inplace=True) 
df2.sort_values(['long', 'lat']), inplace=True)

for irow in df1.itertuples():    
     count = 0
     indexLst = []        
     Location1 = (irow[1], irow[2])    

     for jrow in df2.itertuples():  
          Location2 = (jrow[1], jrow[2])                                      
          if gpd.distance(Location1, Location2).kilometers < 1:
             count += 1
             indexLst.append(jrow[0])    
     if count > 0:                  #only update DF if a match is found
         df1.at[irow[0],'count'] = (count)      
         df2.drop(indexLst, inplace=True)       #drop rows already counted from df2 to speed up next iteration

 #save updated df1 to new csv file
 outFileName = 'combined.csv'
 df1.to_csv(outFileName, sep=',', index=False)

由于df1中的点均匀间隔，因此df2中的每个点仅需要计数一次。 为此，我添加了一个drop statment来从df2中删除行，一旦计算它们以期改善迭代时间。 我最初尝试创建一个merge / join语句，而不是嵌套循环，但是不成功。

在这个阶段，非常感谢任何有关提高效率的帮助！

编辑：目标是更新df1中的'count'列（如下所示），其中df2的点数<1km，并输出到新文件。

df1 =       lat      long    valA   count
        0   123.456  986.54  1      3
        1   223.456  886.54  2      1
        2   323.456  786.54  3      9
        3   423.456  686.54  2      2
        4   523.456  586.54  1      5

Answer 1

经常做这种事后，我发现了几个最佳实践：

1）尽量使用numpy和numba

2）尝试尽可能地利用并行化

3）为矢量化代码跳过循环（我们在这里使用带有numba的循环来利用并行化）。

在这个特殊情况下，我想指出geopy引入的减速。 虽然它是一个很棒的包并且产生非常精确的距离（与Haversine方法相比），但速度要慢得多（没有考虑实现的原因）。

import numpy as np
from geopy import distance

origin = (np.random.uniform(-90,90), np.random.uniform(-180,180))
dest = (np.random.uniform(-90,90), np.random.uniform(-180,180))

%timeit distance.distance(origin, dest)

每个环路216μs±363 ns（平均值±标准偏差，7次运行，每次1000次循环）

这意味着在该时间间隔内，计算1000万x 100万个距离将花费大约216,000,000秒或600k小时。 即使是并行性也只会有很大帮助。

因为当点非常接近时你会感兴趣，我建议使用Haversine距离（在更远的距离上不太准确）。

from numba import jit, prange, vectorize

@vectorize
def haversine(s_lat,s_lng,e_lat,e_lng):

    # approximate radius of earth in km
    R = 6373.0

    s_lat = s_lat*np.pi/180.0                      
    s_lng = np.deg2rad(s_lng)     
    e_lat = np.deg2rad(e_lat)                       
    e_lng = np.deg2rad(e_lng)  

    d = np.sin((e_lat - s_lat)/2)**2 + np.cos(s_lat)*np.cos(e_lat) * np.sin((e_lng - s_lng)/2)**2

    return 2 * R * np.arcsin(np.sqrt(d))

%timeit haversine(origin[0], origin[0], dest[1], dest[1])

每个循环1.85μs±53.9 ns（平均值±标准偏差，7次运行，每次100000次循环）

这已经提高了100倍。 但我们可以做得更好。 您可能已经注意到我从numba添加的@vectorize装饰器。 这允许先前的标量Haversine函数变为矢量化并将矢量作为输入。 我们将在下一步中利用它：

@jit(nopython=True, parallel=True)
def get_nearby_count(coords, coords2, max_dist):
    '''
    Input: `coords`: List of coordinates, lat-lngs in an n x 2 array
           `coords2`: Second list of coordinates, lat-lngs in an k x 2 array
           `max_dist`: Max distance to be considered nearby
    Output: Array of length n with a count of coords nearby coords2
    '''
    # initialize
    n = coords.shape[0]
    k = coords2.shape[0]
    output = np.zeros(n)

    # prange is a parallel loop when operations are independent
    for i in prange(n):
        # comparing a point in coords to the arrays in coords2
        x, y = coords[i]
        # returns an array of length k
        dist = haversine(x, y, coords2[:,0], coords2[:,1])
        # sum the boolean of distances less than the max allowable
        output[i] = np.sum(dist < max_dist)

    return output

希望你现在有一个等于第一组坐标长度的数组（在你的情况下为1000万）。 然后，您可以将其作为计数分配给数据框！

测试时间100,000 x 10,000：

n = 100_000
k = 10_000

coords1 = np.zeros((n, 2))
coords2 = np.zeros((k, 2))

coords1[:,0] = np.random.uniform(-90, 90, n)
coords1[:,1] = np.random.uniform(-180, 180, n)
coords2[:,0] = np.random.uniform(-90, 90, k)
coords2[:,1] = np.random.uniform(-180, 180, k)

%timeit get_nearby_count(coords1, coords2, 1.0)

每循环2.45 s±73.2 ms（平均值±标准偏差，7次运行，每次1次循环）

不幸的是，这仍然意味着你会看到大约20,000秒以上的东西。 这是在一台拥有80个核心的机器上（使用76ish，基于top用量）。

这是我现在能做的最好的，祝你好运（也是，第一篇文章，感谢鼓励我做出贡献！）

PS：你也可以查看Dask数组和函数map_block（）来并行化这个函数（而不是依赖于prange）。 如何对数据进行分区可能会影响总执行时间。

PPS：1,000,000 x 100,000（比全套小100倍）拍摄：3分27秒（207秒），因此缩放看起来是线性的，有点宽容。

PPPS：使用简单的纬度差异滤波器实现：

@jit(nopython=True, parallel=True)
def get_nearby_count_vlat(coords, coords2, max_dist):
    '''
    Input: `coords`: List of coordinates, lat-lngs in an n x 2 array
           `coords2`: List of port coordinates, lat-lngs in an k x 2 array
           `max_dist`: Max distance to be considered nearby
    Output: Array of length n with a count of coords nearby coords2
    '''
    # initialize
    n = coords.shape[0]
    k = coords2.shape[0]
    coords2_abs = np.abs(coords2)
    output = np.zeros(n)

    # prange is a parallel loop when operations are independent
    for i in prange(n):
        # comparing a point in coords to the arrays in coords2
        point = coords[i]
        # subsetting coords2 to reduce haversine calc time. Value .02 is from playing with Gmaps and will need to change for max_dist > 1.0
        coords2_filtered = coords2[np.abs(point[0] - coords2[:,0]) < .02]
        # in case of no matches
        if coords2_filtered.shape[0] == 0: continue
        # returns an array of length k
        dist = haversine(point[0], point[1], coords2_filtered[:,0], coords2_filtered[:,1])
        # sum the boolean of distances less than the max allowable
        output[i] = np.sum(dist < max_dist)

    return output

Answer 2

我最近做了类似的事，但没有lat，lon和我只需要找到最近的点和它的距离。 为此，我使用了scipy.spatial.cKDTree包。 这很快。 cKDTree

我认为在您的情况下，您可以使用query_ball_point（）函数。

from scipy import spatial
import pandas as pd

file1 = 'C:\\path\\file1.csv'    
file2 = 'C:\\path\\file2.csv' 

df1 = pd.read_csv(file1)
df2 = pd.read_csv(file2)
# Build the index
tree = spatial.cKDTree(df1[['long', 'lat']])
# Then query the index

你应该试一试。

与两个大型Pandas Dataframe迭代相比，提高了效率

问题描述

2 个解决方案

解决方案1
5 已采纳 2018-11-26 23:15:05

解决方案2
1 2018-11-27 19:49:12

与两个大型Pandas Dataframe迭代相比，提高了效率

问题描述

2 个解决方案

解决方案1 5 已采纳 2018-11-26 23:15:05

解决方案2 1 2018-11-27 19:49:12

解决方案1
5 已采纳 2018-11-26 23:15:05

解决方案2
1 2018-11-27 19:49:12