简体   繁体   English

将多处理应用于 Pandas 数据框中唯一类别的条目的最有效方法是什么?

[英]What is the most efficient way to apply multiprocessing to unique categories of entries in a pandas dataframe?

I have a large dataset (tsv) that looks something like this:我有一个看起来像这样的大型数据集 (tsv):

category     lat         lon  
apple        34.578967   120.232453  
apple        34.234646   120.535667  
pear         32.564566   120.453567  
peach        33.564567   121.456445  
apple        34.656757   120.423566  

The overall goal would be to pass a dataframe containing all records for a single category to DBScan to generate cluster labels, and do this for all categories using the multiprocessing module.总体目标是将包含单个类别的所有记录的数据帧传递给 DBScan 以生成集群标签,并使用多处理模块对所有类别执行此操作。 I can get this to work, but I'm currently reloading the entire dataset within each process in order to subset to the category because I continue to get errors when attempting to reference the entire dataset as a global variable.我可以让它工作,但我目前正在每个进程中重新加载整个数据集,以便子集到类别,因为在尝试将整个数据集作为全局变量引用时,我继续收到错误。 Code looks like so:代码如下所示:

import pandas as pd
from sklearn.cluster import DBSCAN
import multiprocessing as mp

def findClusters(inCat):
    inTSV = r"C:\trees.csv"
    clDF = pd.read_csv(inTSV, sep='\t')
    catDF = clDF[clDF['category'] == 'inCat']
    kms = 0.05
    scaleDist = 0.01*kms
    x = 'lon'
    y = 'lat'
    dbscan = DBSCAN(eps=scaleDist, min_samples=5)
    clusters = dbscan.fit_predict(catDF[[x,y]])
    catDF['cluster'] = clusters
    catDF.to_csv(r"C:\%s.csv" % (inCat))
    del catDF

if __name__ == "__main__":

    inTSV = r"C:\trees.csv"
    df = pd.read_csv(inTSV, sep='\t')

    catList = list(df.category.unique())

    cores = mp.cpu_count()
    pool = mp.Pool(cores - 1)
    pool.map(findClusters, catList)
    pool.close()
    pool.join()

I know this isn't the most efficient way to do this as I am rereading and also writing out to intermediate files.我知道这不是最有效的方法,因为我正在重读并写出中间文件。 I want to run the clustering of each category of data in parallel.我想并行运行每个类别数据的聚类。 Can I build a list of dataframes (one for each category) that feeds the multiprocessing pool?我可以构建一个为多处理池提供数据的数据框列表(每个类别一个)吗? How would these all be caught after processing (wrapped in a concat call?).处理后如何捕获所有这些(包装在 concat 调用中?)。 Is there a better way to load the data up once in to memory and have each process be able to access it to slice out the category data it needs, how?有没有更好的方法将数据加载到内存中并让每个进程都能够访问它以切出它需要的类别数据,如何?

Running Anaconda, Python 3.5.5运行蟒蛇,Python 3.5.5

Thanks for any insight.感谢您的任何见解。

You can use df.groupby , so note:您可以使用df.groupby ,因此请注意:

In [1]: import pandas as pd

In [2]: df = pd.read_clipboard()

In [3]: df
Out[3]:
  category        lat         lon
0    apple  34.578967  120.232453
1    apple  34.234646  120.535667
2     pear  32.564566  120.453567
3    peach  33.564567  121.456445
4    apple  34.656757  120.423566

In [4]: list(df.groupby('category'))
Out[4]:
[('apple',   category        lat         lon
  0    apple  34.578967  120.232453
  1    apple  34.234646  120.535667
  4    apple  34.656757  120.423566),
 ('peach',   category        lat         lon
  3    peach  33.564567  121.456445),
 ('pear',   category        lat         lon
  2     pear  32.564566  120.453567)]

And just re-write your function to expect a pair, something like:只需重新编写您的函数以期待一对,例如:

def find_clusters(grouped):
    cat, catDF = grouped
    kms = 0.05
    scale_dist = 0.01*kms
    x = 'lon'
    y = 'lat'
    dbscan = DBSCAN(eps=scale_dist, min_samples=5)
    clusters = dbscan.fit_predict(catDF[[x,y]])
    catDF['cluster'] = clusters
    catDF.to_csv(r"C:\%s.csv" % (cat))

Honestly, writing to intermediate files is fine, I think.老实说,我认为写入中间文件很好。

If not, you can always just do:如果没有,你总是可以这样做:

return catDF

Instead of代替

catDF.to_csv(r"C:\%s.csv" % (cat))

And then:进而:

df = pd.concat(pool.map(findClusters, catList))

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在 Python/Pandas 中,将自定义 function 应用于输入包含字符串的 dataframe 的列的最有效方法是什么? - In Python/Pandas, what is the most efficient way, to apply a custom function, to a column of a dataframe, where the input includes strings? 删除有拼写错误的 Pandas 数据帧的最有效方法是什么? - What is the most efficient way to dedupe a Pandas dataframe that has typos? 通过Pandas DataFrame搜索子字符串的最有效方法是什么? - What the most efficient way to search substring through a Pandas DataFrame? 在 Pandas 的数据框中“展平”JSON 的最有效方法是什么? - What is the most efficient way to "flatten" a JSON within a dataframe in pandas? 查找 pandas dataframe 中哪些行不同的最有效方法? - Most efficient way to find what rows differ in pandas dataframe? 检查 Pandas 数据框中列中的多个条件的最有效方法是什么? - What is the most efficient way to check several conditions in columns in a pandas dataframe? 在 pandas dataframe 中计算不同值的最有效方法是什么? - What is the most efficient way to get count of distinct values in a pandas dataframe? 在 pandas 中计算平方 dataframe 的最有效方法 - Most efficient way to compute a square dataframe in pandas pandas DataFrame 中映射列的最有效方法 - Most efficient way of mapping column in pandas DataFrame 在 Pandas Dataframe 中选择/删除条目的最有效方法 - Most efficient method to select/delete entries in Pandas Dataframe
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM