简体   繁体   English

如何加快 Pandas 应用 function 在 dataframe 中创建新列?

[英]How to speed up Pandas apply function to create a new column in the dataframe?

In my pandas dataframe, I have a column which contains user location.在我的 pandas dataframe 中,我有一列包含用户位置。 I have created a function to identify the country from the location and I want to create a new column with the country name.我创建了一个 function 从位置识别国家,我想创建一个带有国家名称的新列。 The function is: function 是:

from geopy.geocoders import Nominatim
geolocator = Nominatim()
import numpy as np

def do_fuzzy_search(location):
    if type(location) == float and np.isnan(location):
        return np.nan
    else:
      try:
          result = pycountry.countries.search_fuzzy(location)
      except Exception:
          try:
              loc = geolocator.geocode(str(location))
              return loc.raw['display_name'].split(', ')[-1]
          except:
              return np.nan
      else:
          return result[0].name

On passing any location name, the function will return the name of the country.在传递任何位置名称时,function 将返回国家名称。 For ex-对于前-

do_fuzzy_search("Bombay") returns 'India' . do_fuzzy_search("Bombay")返回'India'

I simply want to create a new column using apply function.我只是想使用 apply function 创建一个新列。

df['country'] = df.user_location.apply(lambda row: do_fuzzy_search(row) if (pd.notnull(row)) else row)

But it's taking forever to run.但它需要永远运行。 I have tried a few techniques mentioned in other questions posted on Stackoverflow and blogs written with the same theme, like Performance of Pandas apply vs np.vectorize , Optimizing Pandas Code for Speed , Speed up pandas using dask or swift and Speed up pandas using cudf . I have tried a few techniques mentioned in other questions posted on Stackoverflow and blogs written with the same theme, like Performance of Pandas apply vs np.vectorize , Optimizing Pandas Code for Speed , Speed up pandas using dask or swift and Speed up pandas using cudf .

The time taken to execute just the first 10 rows of the column using various techniques are as follows:使用各种技术仅执行该列的前 10 行所花费的时间如下:

%%time
attractions.User_loc[:10].apply(lambda row: do_fuzzy_search(row) if (pd.notnull(row)) else row)
CPU times: user 27 ms, sys: 1.18 ms, total: 28.2 ms
Wall time: 6.59 s
0    United States of America
1                         NaN
2                   Australia
3                       India
4                         NaN
5                   Australia
6                       India
7                       India
8              United Kingdom
9                   Singapore
Name: User_loc, dtype: object

Using Swifter library :使用Swifter 库

%%time
attractions.User_loc[:10].swifter.apply(lambda row: do_fuzzy_search(row) if (pd.notnull(row)) else row)
CPU times: user 1.03 s, sys: 17.9 ms, total: 1.04 s
Wall time: 7.94 s
0    United States of America
1                         NaN
2                   Australia
3                       India
4                         NaN
5                   Australia
6                       India
7                       India
8              United Kingdom
9                   Singapore
Name: User_loc, dtype: object

Using np.vectorize使用np.vectorize

%%time
np.vectorize(do_fuzzy_search)(attractions['User_loc'][:10])
CPU times: user 34.3 ms, sys: 3.13 ms, total: 37.4 ms
Wall time: 9.05 s
array(['United States of America', 'Italia', 'Australia', 'India',
       'Italia', 'Australia', 'India', 'India', 'United Kingdom',
       'Singapore'], dtype='<U24')

Also, used Dask's map_partitions which did not give much performance gain over the apply function.此外,使用Dask 的 map_partitions并没有比应用 function 带来太多性能提升。

import dask.dataframe as dd
import multiprocessing

dd.from_pandas(attractions.User_loc, npartitions=4*multiprocessing.cpu_count())\
   .map_partitions(lambda df: df.apply(lambda row: do_fuzzy_search(row) if (pd.notnull(row)) else row)).compute(scheduler='processes')

The computation time for 10 rows is more than 5 seconds for each technique.每种技术的 10 行计算时间超过 5 秒。 It's taking forever for 100k rows. 100k 行需要永远。 I also tried to implement cudf but that's crashing my colab notebook.我也尝试实现 cudf 但这会使我的 colab 笔记本崩溃。

What can I do to improve the performance and achieve the result in reasonable time?我可以做些什么来提高性能并在合理的时间内达到结果?

In most cases, an .apply() is slow because it's calling some trivially parallelizable function once per row of a dataframe, but in your case, you're calling an external API.在大多数情况下, .apply()很慢,因为它在 dataframe 的每一行调用一些简单的可并行化 function 一次,但在您的情况下,您调用的是外部 ZDB974238714CA8DE6434ACE。 As such, network access and API rate limiting are likely to be the primary factors determining runtime.因此,网络访问和 API 速率限制可能是决定运行时间的主要因素。 Unfortunately, that means there's not an awful lot you can do other than wait.不幸的是,这意味着除了等待之外,您无能为力。

You might be able to benefit by decorating do_fuzzy_search with functools.lru_cache if some elements are frequently repeated since that will allow the function to avoid the API call if the location is found in the cache.如果某些元素经常重复,您可能可以通过使用functools.lru_cache装饰do_fuzzy_search受益,因为如果在缓存中找到该位置,这将允许 function 避免 API 调用。

This looks like IO bound and not CPU bound issue.这看起来像 IO 绑定问题,而不是 CPU 绑定问题。 Multiprocessing would not help.多处理无济于事。 The major bottleneck is your call to Nominatim() .主要瓶颈是您对Nominatim()的调用。 You make a http request to their API for every non-NaN column.您为每个非 NaN 列向他们的 API 发出 http 请求。 This means if 'India' is in 5 places, you will make 5 calls for India which wastefully returns the same geolocation for 5 rows.这意味着如果 'India' 在 5 个地方,您将对 India 进行 5 次调用,这会浪费地返回 5 行的相同地理位置。

The optimisation of these would require a mixture of caching most frequently location locally and also the new few calls during calls.这些的优化将需要在本地缓存最频繁的location以及在调用期间新的少数调用。

  1. Create a DataFrame with most N frequent locations.创建一个具有最多 N 个频繁位置的 DataFrame。
  2. Call Nominatim() on the most frequent locations and save this as lookup dict/json eg location_geo = df.set_index('location').to_dict()['geolocation']在最常见的位置调用Nominatim()并将其保存为查找 dict/json 例如location_geo = df.set_index('location').to_dict()['geolocation']
  3. Save it eg 'json.dump...`保存它,例如'json.dump ...`
  4. In your function, we will check if the location is in your cached location_geo dictionary, and return the value.在您的 function 中,我们将检查该位置是否在您缓存的 location_geo 字典中,并返回该值。 If not then make a call to Nominatim API.如果没有,请致电 Nominatim API。

In the end you would have something like this:最后你会得到这样的东西:

import json
from functools import lru_cache
from geopy.geocoders import Nominatim
import numpy as np

geolocator = Nominatim()

# load most frequently locations
with open('our_save_freq_location.json', 'r') as f:
    location_geolocation = json.load(f)

@lru_cache
def do_fuzzy_search(location):
    if type(location) == float and np.isnan(location):
        return np.nan
    else:
      try:
          result = pycountry.countries.search_fuzzy(location)
      except Exception:
          try:
              # look first in our dictionary, if not call Nominatim
              loc = locations_geolocation.get(location, geolocator.geocode(location))
              return loc.raw['display_name'].split(', ')[-1]
          except:
              return np.nan
      else:
          return result[0].name

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM