简体   繁体   English

如何在字符串上的 pandas 中并行化。

[英]How do I parallelize .apply in pandas on string?

I realize this question might've been asked before, but I didn't find solution that works specifically for strings and is relatively simple.我意识到这个问题之前可能已经被问过,但我没有找到专门用于字符串并且相对简单的解决方案。

I have a data frame that has a column with a zip code that uses remote API to fetch details about this zip code.我有一个数据框,其中有一列包含 zip 代码,该代码使用远程 API 来获取有关此 zip 代码的详细信息。 What I'm trying is to parallelize data fetching to perform it in multiple threads.我正在尝试并行化数据获取以在多个线程中执行它。

Simple example:简单的例子:

def get_cities_by_zip_code(zip):
    resp = requests.post(geo_svc_url, json={'query': """query GetZipCodeInformation($zip: Float!) {
  zipCode(zip: $zip) {
    ....
  }
}""", 'variables': {'zip': zip}})

    return resp.json()['data']['zipCode']


def location_options(df):
  resp = get_cities_by_zip_code(df['Zip code'])

  if resp is not None:
    df['City'] = resp['preferredName']
    df['Population'] = (next(x for x in resp['places'] if x['type'] == 'city') or { 'population': 'n/a' })['population']

  return df

def make_df():
  // A function that generates initial dataframe


df = make_df()

Then I have to apply location_options on df parallel.然后我必须在df parallel 上应用location_options I tried a couple of solutions to achieve that.我尝试了几种解决方案来实现这一目标。 For example:例如:

  1. Via multiprocessing通过multiprocessing
num_partitions = 20 #number of partitions to split dataframe
num_cores = 8 #number of cores on your machine

def parallelize_dataframe(df, func):
    df_split = np.array_split(df, num_partitions)
    pool = Pool(num_cores)
    df = pd.concat(pool.map(func, df_split))
    pool.close()
    pool.join()
    return df


df = parallelize_dataframe(df, location_options)

It doesn't work (not a full stacktrace).它不起作用(不是完整的堆栈跟踪)。

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
TypeError: Object of type Series is not JSON serializable
  1. swifter - doesn't work with strings for some reasons, runs but only one thread. swifter - 由于某些原因不能使用字符串,但只运行一个线程。

Whether as simple是否简单

df = df.apply(location_options, axis=1)

works just fine.工作得很好。 But it's single threaded.但它是单线程的。

I may have found a solution from the related post.我可能从相关帖子中找到了解决方案。

This one worked for me, while others didn't. 这个对我有用,而其他人没有。 I also had to this: https://github.com/darkskyapp/forecast-ruby/issues/13我也不得不这样做: https://github.com/darkskyapp/forecast-ruby/issues/13

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM