将多处理应用于 Pandas 数据框中唯一类别的条目的最有效方法是什么？

Question

I have a large dataset (tsv) that looks something like this:我有一个看起来像这样的大型数据集 (tsv)：

category     lat         lon  
apple        34.578967   120.232453  
apple        34.234646   120.535667  
pear         32.564566   120.453567  
peach        33.564567   121.456445  
apple        34.656757   120.423566

The overall goal would be to pass a dataframe containing all records for a single category to DBScan to generate cluster labels, and do this for all categories using the multiprocessing module.总体目标是将包含单个类别的所有记录的数据帧传递给 DBScan 以生成集群标签，并使用多处理模块对所有类别执行此操作。 I can get this to work, but I'm currently reloading the entire dataset within each process in order to subset to the category because I continue to get errors when attempting to reference the entire dataset as a global variable.我可以让它工作，但我目前正在每个进程中重新加载整个数据集，以便子集到类别，因为在尝试将整个数据集作为全局变量引用时，我继续收到错误。 Code looks like so:代码如下所示：

import pandas as pd
from sklearn.cluster import DBSCAN
import multiprocessing as mp

def findClusters(inCat):
    inTSV = r"C:\trees.csv"
    clDF = pd.read_csv(inTSV, sep='\t')
    catDF = clDF[clDF['category'] == 'inCat']
    kms = 0.05
    scaleDist = 0.01*kms
    x = 'lon'
    y = 'lat'
    dbscan = DBSCAN(eps=scaleDist, min_samples=5)
    clusters = dbscan.fit_predict(catDF[[x,y]])
    catDF['cluster'] = clusters
    catDF.to_csv(r"C:\%s.csv" % (inCat))
    del catDF

if __name__ == "__main__":

    inTSV = r"C:\trees.csv"
    df = pd.read_csv(inTSV, sep='\t')

    catList = list(df.category.unique())

    cores = mp.cpu_count()
    pool = mp.Pool(cores - 1)
    pool.map(findClusters, catList)
    pool.close()
    pool.join()

I know this isn't the most efficient way to do this as I am rereading and also writing out to intermediate files.我知道这不是最有效的方法，因为我正在重读并写出中间文件。 I want to run the clustering of each category of data in parallel.我想并行运行每个类别数据的聚类。 Can I build a list of dataframes (one for each category) that feeds the multiprocessing pool?我可以构建一个为多处理池提供数据的数据框列表（每个类别一个）吗？ How would these all be caught after processing (wrapped in a concat call?).处理后如何捕获所有这些（包装在 concat 调用中？）。 Is there a better way to load the data up once in to memory and have each process be able to access it to slice out the category data it needs, how?有没有更好的方法将数据加载到内存中并让每个进程都能够访问它以切出它需要的类别数据，如何？

Running Anaconda, Python 3.5.5运行蟒蛇，Python 3.5.5

Thanks for any insight.感谢您的任何见解。

Answer 1

You can use df.groupby , so note:您可以使用df.groupby ，因此请注意：

In [1]: import pandas as pd

In [2]: df = pd.read_clipboard()

In [3]: df
Out[3]:
  category        lat         lon
0    apple  34.578967  120.232453
1    apple  34.234646  120.535667
2     pear  32.564566  120.453567
3    peach  33.564567  121.456445
4    apple  34.656757  120.423566

In [4]: list(df.groupby('category'))
Out[4]:
[('apple',   category        lat         lon
  0    apple  34.578967  120.232453
  1    apple  34.234646  120.535667
  4    apple  34.656757  120.423566),
 ('peach',   category        lat         lon
  3    peach  33.564567  121.456445),
 ('pear',   category        lat         lon
  2     pear  32.564566  120.453567)]

And just re-write your function to expect a pair, something like:只需重新编写您的函数以期待一对，例如：

def find_clusters(grouped):
    cat, catDF = grouped
    kms = 0.05
    scale_dist = 0.01*kms
    x = 'lon'
    y = 'lat'
    dbscan = DBSCAN(eps=scale_dist, min_samples=5)
    clusters = dbscan.fit_predict(catDF[[x,y]])
    catDF['cluster'] = clusters
    catDF.to_csv(r"C:\%s.csv" % (cat))

Honestly, writing to intermediate files is fine, I think.老实说，我认为写入中间文件很好。

If not, you can always just do:如果没有，你总是可以这样做：

return catDF

Instead of代替

catDF.to_csv(r"C:\%s.csv" % (cat))

And then:进而：

df = pd.concat(pool.map(findClusters, catList))

将多处理应用于 Pandas 数据框中唯一类别的条目的最有效方法是什么？

问题描述

1 个解决方案

解决方案1
2 已采纳 2018-08-20 14:11:18

将多处理应用于 Pandas 数据框中唯一类别的条目的最有效方法是什么？

问题描述

1 个解决方案

解决方案1 2 已采纳 2018-08-20 14:11:18

解决方案1
2 已采纳 2018-08-20 14:11:18