[英]How to get position for records grouped by one column sorted and sorted by another pandas DataFrame
I have a very large DataFrame with ~100M rows that looks like this:我有一个非常大的 DataFrame,大约 100M 行,如下所示:
query score1 score2 key
0 query0 97.149704 1.317513 key1
1 query1 86.344880 1.337784 key2
2 query2 85.192480 1.312714 key3
3 query1 86.240326 1.317513 key4
4 query2 85.192480 1.312714 key5
...
I want to group the dataframe by "query"
and then get the position of each row sorted by "score1"
and "score2"
(higher is better) so the output should look like this -我想通过
"query"
对数据框进行分组,然后获取按"score1"
和"score2"
排序的每一行的位置(越高越好),所以输出应该如下所示 -
query score1 score2 key pos1 pos2
0 query0 97.149704 1.317513 key1 0 0
1 query1 86.344880 1.237784 key2 0 1
2 query2 85.192480 1.312714 key3 1 0
3 query1 86.240326 1.317513 key4 1 0
4 query2 85.492410 1.212714 key5 0 1
Currently, I have a function that looks something like this:目前,我有一个看起来像这样的函数:
def func(query, df, score1=True):
mini_df = df[df["query"] == query]
mini_df.reset_index(drop=True, inplace=True)
col_name = "pos_score2"
if score1:
col_name = "pos_score1"
mini_df[col_name] = mini_df.index
return mini_df
which I call from main()
:我从
main()
调用:
p = Pool(cpu_count())
df_list = list(p.starmap(func, zip(queries, repeat(df))))
df = pd.concat(df_list, ignore_index=True)
but it takes a long time.但这需要很长时间。 I am running this on machine with 96 CPUs Intel Xeon with 512G memory and it still takes more than 24 hrs.
我在具有 96 个 CPU、Intel Xeon 和 512G 内存的机器上运行它,它仍然需要超过 24 小时。 What would be a much faster way to achieve this?
实现这一目标的更快方法是什么?
Use groupby
and rank
:使用
groupby
和rank
:
df[['pos1', 'pos2']] = (df.groupby('query')[['score1', 'score2']]
.rank(method='max', ascending=False)
.sub(1).astype(int))
print(df)
# Output
query score1 score2 key pos1 pos2
0 query0 97.149704 1.317513 key1 0 0
1 query1 86.344880 1.237784 key2 0 1
2 query2 85.192480 1.312714 key3 1 0
3 query1 86.240326 1.317513 key4 1 0
4 query2 85.492410 1.212714 key5 0 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.