如何獲取按一列分組的記錄的位置，該列由另一個熊貓數據框排序和排序

Question

我有一個非常大的 DataFrame，大約 100M 行，如下所示：

    query     score1    score2   key
0  query0  97.149704  1.317513  key1
1  query1  86.344880  1.337784  key2
2  query2  85.192480  1.312714  key3
3  query1  86.240326  1.317513  key4
4  query2  85.192480  1.312714  key5
...

我想通過"query"對數據框進行分組，然后獲取按"score1"和"score2"排序的每一行的位置（越高越好），所以輸出應該如下所示 -

    query     score1    score2   key  pos1  pos2
0  query0  97.149704  1.317513  key1     0     0
1  query1  86.344880  1.237784  key2     0     1
2  query2  85.192480  1.312714  key3     1     0
3  query1  86.240326  1.317513  key4     1     0
4  query2  85.492410  1.212714  key5     0     1

目前，我有一個看起來像這樣的函數：

def func(query, df, score1=True):
    mini_df = df[df["query"] == query]
    mini_df.reset_index(drop=True, inplace=True)
    col_name = "pos_score2"
    if score1:
        col_name = "pos_score1"
    mini_df[col_name] = mini_df.index
    return mini_df

我從main()調用：

p = Pool(cpu_count())
df_list = list(p.starmap(func, zip(queries, repeat(df))))
df = pd.concat(df_list, ignore_index=True)

但這需要很長時間。 我在具有 96 個 CPU、Intel Xeon 和 512G 內存的機器上運行它，它仍然需要超過 24 小時。 實現這一目標的更快方法是什么？

Answer 1

使用groupby和rank ：

df[['pos1', 'pos2']] = (df.groupby('query')[['score1', 'score2']]
                          .rank(method='max', ascending=False)
                          .sub(1).astype(int))
print(df)

# Output
    query     score1    score2   key  pos1  pos2
0  query0  97.149704  1.317513  key1     0     0
1  query1  86.344880  1.237784  key2     0     1
2  query2  85.192480  1.312714  key3     1     0
3  query1  86.240326  1.317513  key4     1     0
4  query2  85.492410  1.212714  key5     0     1

如何獲取按一列分組的記錄的位置，該列由另一個熊貓數據框排序和排序

問題描述

1 個解決方案

解決方案1
0 2022-06-18 22:21:55

如何獲取按一列分組的記錄的位置，該列由另一個熊貓數據框排序和排序

問題描述

1 個解決方案

解決方案1 0 2022-06-18 22:21:55

解決方案1
0 2022-06-18 22:21:55