简体   繁体   English

nunique 的 Dask Grouby 性能太慢。 如何提高性能?

[英]Dask Grouby Performance for nunique is too slow. How to improve the performance?

I have large files of size more than 5GB.我有超过 5GB 的大文件。 I have stored them in parquet format.我以镶木地板格式存储它们。 When I do groupby operation as shown below code for small sample set of 600k+ records, Dask is taking more than 6 mins, whereas pandas took only 0.4 seconds.当我对 600k+ 条记录的小样本集进行如下代码所示的 groupby 操作时,Dask 耗时超过 6 分钟,而 pandas 仅耗时 0.4 秒。 Though I understand pandas is faster if the dataset fits in memory, my question is if I pass entire parquet file to Dask dataframe, will performance improve?虽然我知道如果数据集适合 memory,pandas 会更快,但我的问题是,如果我将整个镶木地板文件传递给 Dask dataframe,性能会提高吗?

Also suggest me how to improve the below code, so that i can run in few seconds rather than in minutes.还建议我如何改进以下代码,以便我可以在几秒钟而不是几分钟内运行。

Example: Using Dask Dataframe示例:使用 Dask Dataframe

StartTime = datetime.datetime.now()
df = dd.read_parquet('201908.parquet', columns=['A', 'B'], engine='pyarrow')
print (len(df))
df = df.set_index ('A')
rs = df.groupby('A').B.nunique().nlargest(10).compute(scheduler='processes')
print (rs)
EndTime = datetime.datetime.now()
print ("Total Time Taken for processing: " + str(EndTime - StartTime))

Output is: Output 是:

606995
A
-3868378286825833950    7
 1230391617280615928    7
 381683316762598393     6
-5730635895723403090    5
 903278193888929465     5
 2861437302225712286    5
-9057855329515864244    4
-8963355998258854688    4
-7876321060385968364    4
-6825439721748529898    4
Name: B, dtype: int64
Total Time Taken for processing: 0:06:05.042146

Example using Pandas:使用 Pandas 的示例:

StartTime = datetime.datetime.now()
df = pd.read_parquet('201908.parquet', columns=['A', 'B'], engine='pyarrow')
print (len(df))
df = df.set_index ('A')
rs = df.groupby('A').B.nunique().nlargest(10)
print (rs)
EndTime = datetime.datetime.now()
print ("Total Time Taken for processing: " + str(EndTime - StartTime))

Output is: Output 是:

606995
A
-3868378286825833950    7
 1230391617280615928    7
 381683316762598393     6
-5730635895723403090    5
 903278193888929465     5
 2861437302225712286    5
-9057855329515864244    4
-8963355998258854688    4
-7876321060385968364    4
-6825439721748529898    4
Name: B, dtype: int64
Total Time Taken for processing: 0:00:00.419033

I believe that there is an open issue for an approximate groupby nunique algorithm for dask dataframe.我相信对于 dask dataframe 的近似 groupby nunique 算法存在一个未解决的问题。 You might look into that if you're particularly interested.如果你特别感兴趣,你可以研究一下。 Dask dataframe's non-groupby nunique algorithm is quite a bit faster. Dask 数据帧的非 groupby nunique 算法要快得多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM