在熊猫数据框中应用自定义函数

Question

我有一个数据框： df_input1有 10M 行。 列名之一是"geolocation" 。 对于所有记录，我必须从地理位置中找到州名称并填写另一个数据框的“州”列： df_final 。 为此，我创建了一个函数convert_to_state并使用如下所示：

 df_final['State']  = df_input1['geolocations'].apply(convert_to_state)

有没有更快的方法来实现这一点，因为这需要很多时间。

样本数据： df_input1

vehicle-no start                end                   geolocations
123        10/12/2019 09:00:12  10/12/2019 11:00:78   fghdrf3245@bafd
456        12/10/2019 06:09:12  10/10/2019 09:23:12   {098ddc76yhfbdb7877]

自定义函数：

import reverse_geocoder as rg 
import polyline
def convert_to_state(geoloc):
    long_lat = polyline.decode(geoloc)[0]     
    state_name= rg.search(long_lat)[0]["admin1"]
    return state_name

Answer 1

我建议使用 numpy 来制作矢量化函数

import numpy as np
import pandas as pd
import reverse_geocoder as rg 
import polyline
def convert_to_state(geoloc):
    long_lat = polyline.decode(geoloc)[0]     
    state_name= rg.search(long_lat)[0]["admin1"]
    return state_name


convert_to_state = np.vectorize(convert_to_state) # vectorize the method

col = df_input1['geolocations'].values # A numpy array of the column
df_final['State']  = pd.Series(convert_to_state(col))

在 numpy 数组上运行的矢量化函数会带来很大的提升，然后您将其转换回 pandas Series。

我强烈建议在.apply使用%timeit装饰器计时此方法和正常的.apply方法，并报告较小子集的运行时

这是一个非常愚蠢的例子

In [1]: import pandas as pd                               

In [2]: import numpy as np                                

In [3]: x = pd.DataFrame( 
   ...:     [ 
   ...:         [1,2,"Some.Text"], 
   ...:         [3,4,"More.Text"] 
   ...:     ], 
   ...:     columns = ["A","B", "C"] 
   ...: )                                                 

In [4]: x                                                 
Out[4]: 
   A  B          C
0  1  2  Some.Text
1  3  4  More.Text

In [5]: def foo_split(t): 
   ...:     return t.split(".")[0] 
   ...:                                                   

In [6]: %timeit y = x.C.apply(foo_split)                  
248 µs ± 4.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [7]: c = x.C.values # numpy array of the column        

In [8]: foo_split_vect = np.vectorize(foo_split)          

In [9]: %timeit z = pd.Series(foo_split_vect(c))          
159 µs ± 624 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

您可能会看到在这种情况下速度基本上翻了一番。

Answer 2

由于子例程本质上是纯函数式的（每一行的处理不受另一行的影响）我们可以利用多线程使其运行得更快

您可以使用以下

Command Prompt : pip install swifter

import swifter
df_final['State']  = df_input1['geolocations'].swifter.apply(convert_to_state)

在熊猫数据框中应用自定义函数

问题描述

2 个解决方案

解决方案1
1 已采纳 2020-03-26 11:27:10

这是一个非常愚蠢的例子

解决方案2
0 2020-03-26 11:15:48

在熊猫数据框中应用自定义函数

问题描述

2 个解决方案

解决方案1 1 已采纳 2020-03-26 11:27:10

这是一个非常愚蠢的例子

解决方案2 0 2020-03-26 11:15:48

解决方案1
1 已采纳 2020-03-26 11:27:10

解决方案2
0 2020-03-26 11:15:48