[英]What is the most efficient way to apply multiprocessing to unique categories of entries in a pandas dataframe?
I have a large dataset (tsv) that looks something like this:我有一个看起来像这样的大型数据集 (tsv):
category lat lon
apple 34.578967 120.232453
apple 34.234646 120.535667
pear 32.564566 120.453567
peach 33.564567 121.456445
apple 34.656757 120.423566
The overall goal would be to pass a dataframe containing all records for a single category to DBScan to generate cluster labels, and do this for all categories using the multiprocessing module.总体目标是将包含单个类别的所有记录的数据帧传递给 DBScan 以生成集群标签,并使用多处理模块对所有类别执行此操作。 I can get this to work, but I'm currently reloading the entire dataset within each process in order to subset to the category because I continue to get errors when attempting to reference the entire dataset as a global variable.
我可以让它工作,但我目前正在每个进程中重新加载整个数据集,以便子集到类别,因为在尝试将整个数据集作为全局变量引用时,我继续收到错误。 Code looks like so:
代码如下所示:
import pandas as pd
from sklearn.cluster import DBSCAN
import multiprocessing as mp
def findClusters(inCat):
inTSV = r"C:\trees.csv"
clDF = pd.read_csv(inTSV, sep='\t')
catDF = clDF[clDF['category'] == 'inCat']
kms = 0.05
scaleDist = 0.01*kms
x = 'lon'
y = 'lat'
dbscan = DBSCAN(eps=scaleDist, min_samples=5)
clusters = dbscan.fit_predict(catDF[[x,y]])
catDF['cluster'] = clusters
catDF.to_csv(r"C:\%s.csv" % (inCat))
del catDF
if __name__ == "__main__":
inTSV = r"C:\trees.csv"
df = pd.read_csv(inTSV, sep='\t')
catList = list(df.category.unique())
cores = mp.cpu_count()
pool = mp.Pool(cores - 1)
pool.map(findClusters, catList)
pool.close()
pool.join()
I know this isn't the most efficient way to do this as I am rereading and also writing out to intermediate files.我知道这不是最有效的方法,因为我正在重读并写出中间文件。 I want to run the clustering of each category of data in parallel.
我想并行运行每个类别数据的聚类。 Can I build a list of dataframes (one for each category) that feeds the multiprocessing pool?
我可以构建一个为多处理池提供数据的数据框列表(每个类别一个)吗? How would these all be caught after processing (wrapped in a concat call?).
处理后如何捕获所有这些(包装在 concat 调用中?)。 Is there a better way to load the data up once in to memory and have each process be able to access it to slice out the category data it needs, how?
有没有更好的方法将数据加载到内存中并让每个进程都能够访问它以切出它需要的类别数据,如何?
Running Anaconda, Python 3.5.5运行蟒蛇,Python 3.5.5
Thanks for any insight.感谢您的任何见解。
You can use df.groupby
, so note:您可以使用
df.groupby
,因此请注意:
In [1]: import pandas as pd
In [2]: df = pd.read_clipboard()
In [3]: df
Out[3]:
category lat lon
0 apple 34.578967 120.232453
1 apple 34.234646 120.535667
2 pear 32.564566 120.453567
3 peach 33.564567 121.456445
4 apple 34.656757 120.423566
In [4]: list(df.groupby('category'))
Out[4]:
[('apple', category lat lon
0 apple 34.578967 120.232453
1 apple 34.234646 120.535667
4 apple 34.656757 120.423566),
('peach', category lat lon
3 peach 33.564567 121.456445),
('pear', category lat lon
2 pear 32.564566 120.453567)]
And just re-write your function to expect a pair, something like:只需重新编写您的函数以期待一对,例如:
def find_clusters(grouped):
cat, catDF = grouped
kms = 0.05
scale_dist = 0.01*kms
x = 'lon'
y = 'lat'
dbscan = DBSCAN(eps=scale_dist, min_samples=5)
clusters = dbscan.fit_predict(catDF[[x,y]])
catDF['cluster'] = clusters
catDF.to_csv(r"C:\%s.csv" % (cat))
Honestly, writing to intermediate files is fine, I think.老实说,我认为写入中间文件很好。
If not, you can always just do:如果没有,你总是可以这样做:
return catDF
Instead of代替
catDF.to_csv(r"C:\%s.csv" % (cat))
And then:进而:
df = pd.concat(pool.map(findClusters, catList))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.