[英]pandas.core.groupby.DataFrameGroupBy.idxmin() is very slow , how can i make my code faster?
i am trying to do same action as SQL group by and take min value:我正在尝试执行与 SQL group by 相同的操作并取最小值:
select id,min(value) ,other_fields...
from table
group by ('id')
i tried:我试过:
dfg = df.groupby('id', sort=False)
idx = dfg['value'].idxmin()
df = df.loc[idx, list(df.columns.values)]
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.idxmin.html but line 2 the idxmin() is taking more than half hour on ~4M columns in df where the group by takes less than 1 second, what am i missing is it suppose to take that long? https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.idxmin.html但是第 2 行 idxmin() 在 df 中的 ~4M 列上花费了半个多小时group by 花费不到 1 秒的地方,我想念的是它应该花那么长时间吗? how can make this process faster?
如何使这个过程更快? will it be faster in pure SQL?
在纯 SQL 中会更快吗?
use alternative with DataFrame.sort_values
and DataFrame.drop_duplicates
:使用替代
DataFrame.sort_values
和DataFrame.drop_duplicates
:
df1 = df.sort_values(by=['value']).drop_duplicates('id', keep='first')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.