[英]How to speed up Pandas apply function to create a new column in the dataframe?
In my pandas dataframe, I have a column which contains user location.在我的 pandas dataframe 中,我有一列包含用户位置。 I have created a function to identify the country from the location and I want to create a new column with the country name.我创建了一个 function 从位置识别国家,我想创建一个带有国家名称的新列。 The function is: function 是:
from geopy.geocoders import Nominatim
geolocator = Nominatim()
import numpy as np
def do_fuzzy_search(location):
if type(location) == float and np.isnan(location):
return np.nan
else:
try:
result = pycountry.countries.search_fuzzy(location)
except Exception:
try:
loc = geolocator.geocode(str(location))
return loc.raw['display_name'].split(', ')[-1]
except:
return np.nan
else:
return result[0].name
On passing any location name, the function will return the name of the country.在传递任何位置名称时,function 将返回国家名称。 For ex-对于前-
do_fuzzy_search("Bombay")
returns 'India'
. do_fuzzy_search("Bombay")
返回'India'
。
I simply want to create a new column using apply function.我只是想使用 apply function 创建一个新列。
df['country'] = df.user_location.apply(lambda row: do_fuzzy_search(row) if (pd.notnull(row)) else row)
But it's taking forever to run.但它需要永远运行。 I have tried a few techniques mentioned in other questions posted on Stackoverflow and blogs written with the same theme, like Performance of Pandas apply vs np.vectorize , Optimizing Pandas Code for Speed , Speed up pandas using dask or swift and Speed up pandas using cudf . I have tried a few techniques mentioned in other questions posted on Stackoverflow and blogs written with the same theme, like Performance of Pandas apply vs np.vectorize , Optimizing Pandas Code for Speed , Speed up pandas using dask or swift and Speed up pandas using cudf .
The time taken to execute just the first 10 rows of the column using various techniques are as follows:使用各种技术仅执行该列的前 10 行所花费的时间如下:
%%time
attractions.User_loc[:10].apply(lambda row: do_fuzzy_search(row) if (pd.notnull(row)) else row)
CPU times: user 27 ms, sys: 1.18 ms, total: 28.2 ms
Wall time: 6.59 s
0 United States of America
1 NaN
2 Australia
3 India
4 NaN
5 Australia
6 India
7 India
8 United Kingdom
9 Singapore
Name: User_loc, dtype: object
Using Swifter library :使用Swifter 库:
%%time
attractions.User_loc[:10].swifter.apply(lambda row: do_fuzzy_search(row) if (pd.notnull(row)) else row)
CPU times: user 1.03 s, sys: 17.9 ms, total: 1.04 s
Wall time: 7.94 s
0 United States of America
1 NaN
2 Australia
3 India
4 NaN
5 Australia
6 India
7 India
8 United Kingdom
9 Singapore
Name: User_loc, dtype: object
Using np.vectorize使用np.vectorize
%%time
np.vectorize(do_fuzzy_search)(attractions['User_loc'][:10])
CPU times: user 34.3 ms, sys: 3.13 ms, total: 37.4 ms
Wall time: 9.05 s
array(['United States of America', 'Italia', 'Australia', 'India',
'Italia', 'Australia', 'India', 'India', 'United Kingdom',
'Singapore'], dtype='<U24')
Also, used Dask's map_partitions which did not give much performance gain over the apply function.此外,使用Dask 的 map_partitions并没有比应用 function 带来太多性能提升。
import dask.dataframe as dd
import multiprocessing
dd.from_pandas(attractions.User_loc, npartitions=4*multiprocessing.cpu_count())\
.map_partitions(lambda df: df.apply(lambda row: do_fuzzy_search(row) if (pd.notnull(row)) else row)).compute(scheduler='processes')
The computation time for 10 rows is more than 5 seconds for each technique.每种技术的 10 行计算时间超过 5 秒。 It's taking forever for 100k rows. 100k 行需要永远。 I also tried to implement cudf but that's crashing my colab notebook.我也尝试实现 cudf 但这会使我的 colab 笔记本崩溃。
What can I do to improve the performance and achieve the result in reasonable time?我可以做些什么来提高性能并在合理的时间内达到结果?
In most cases, an .apply()
is slow because it's calling some trivially parallelizable function once per row of a dataframe, but in your case, you're calling an external API.在大多数情况下, .apply()
很慢,因为它在 dataframe 的每一行调用一些简单的可并行化 function 一次,但在您的情况下,您调用的是外部 ZDB974238714CA8DE6434ACE。 As such, network access and API rate limiting are likely to be the primary factors determining runtime.因此,网络访问和 API 速率限制可能是决定运行时间的主要因素。 Unfortunately, that means there's not an awful lot you can do other than wait.不幸的是,这意味着除了等待之外,您无能为力。
You might be able to benefit by decorating do_fuzzy_search
with functools.lru_cache if some elements are frequently repeated since that will allow the function to avoid the API call if the location is found in the cache.如果某些元素经常重复,您可能可以通过使用functools.lru_cache装饰do_fuzzy_search
受益,因为如果在缓存中找到该位置,这将允许 function 避免 API 调用。
This looks like IO bound and not CPU bound issue.这看起来像 IO 绑定问题,而不是 CPU 绑定问题。 Multiprocessing would not help.多处理无济于事。 The major bottleneck is your call to Nominatim()
.主要瓶颈是您对Nominatim()
的调用。 You make a http request to their API for every non-NaN column.您为每个非 NaN 列向他们的 API 发出 http 请求。 This means if 'India' is in 5 places, you will make 5 calls for India which wastefully returns the same geolocation for 5 rows.这意味着如果 'India' 在 5 个地方,您将对 India 进行 5 次调用,这会浪费地返回 5 行的相同地理位置。
The optimisation of these would require a mixture of caching most frequently location
locally and also the new few calls during calls.这些的优化将需要在本地缓存最频繁的location
以及在调用期间新的少数调用。
Nominatim()
on the most frequent locations and save this as lookup dict/json eg location_geo = df.set_index('location').to_dict()['geolocation']
在最常见的位置调用Nominatim()
并将其保存为查找 dict/json 例如location_geo = df.set_index('location').to_dict()['geolocation']
In the end you would have something like this:最后你会得到这样的东西:
import json
from functools import lru_cache
from geopy.geocoders import Nominatim
import numpy as np
geolocator = Nominatim()
# load most frequently locations
with open('our_save_freq_location.json', 'r') as f:
location_geolocation = json.load(f)
@lru_cache
def do_fuzzy_search(location):
if type(location) == float and np.isnan(location):
return np.nan
else:
try:
result = pycountry.countries.search_fuzzy(location)
except Exception:
try:
# look first in our dictionary, if not call Nominatim
loc = locations_geolocation.get(location, geolocator.geocode(location))
return loc.raw['display_name'].split(', ')[-1]
except:
return np.nan
else:
return result[0].name
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.