简体   繁体   English

如何获取按一列分组的记录的位置,该列由另一个熊猫数据框排序和排序

[英]How to get position for records grouped by one column sorted and sorted by another pandas DataFrame

I have a very large DataFrame with ~100M rows that looks like this:我有一个非常大的 DataFrame,大约 100M 行,如下所示:

    query     score1    score2   key
0  query0  97.149704  1.317513  key1
1  query1  86.344880  1.337784  key2
2  query2  85.192480  1.312714  key3
3  query1  86.240326  1.317513  key4
4  query2  85.192480  1.312714  key5
...

I want to group the dataframe by "query" and then get the position of each row sorted by "score1" and "score2" (higher is better) so the output should look like this -我想通过"query"对数据框进行分组,然后获取按"score1""score2"排序的每一行的位置(越高越好),所以输出应该如下所示 -

    query     score1    score2   key  pos1  pos2
0  query0  97.149704  1.317513  key1     0     0
1  query1  86.344880  1.237784  key2     0     1
2  query2  85.192480  1.312714  key3     1     0
3  query1  86.240326  1.317513  key4     1     0
4  query2  85.492410  1.212714  key5     0     1

Currently, I have a function that looks something like this:目前,我有一个看起来像这样的函数:

def func(query, df, score1=True):
    mini_df = df[df["query"] == query]
    mini_df.reset_index(drop=True, inplace=True)
    col_name = "pos_score2"
    if score1:
        col_name = "pos_score1"
    mini_df[col_name] = mini_df.index
    return mini_df

which I call from main() :我从main()调用:

p = Pool(cpu_count())
df_list = list(p.starmap(func, zip(queries, repeat(df))))
df = pd.concat(df_list, ignore_index=True)

but it takes a long time.但这需要很长时间。 I am running this on machine with 96 CPUs Intel Xeon with 512G memory and it still takes more than 24 hrs.我在具有 96 个 CPU、Intel Xeon 和 512G 内存的机器上运行它,它仍然需要超过 24 小时。 What would be a much faster way to achieve this?实现这一目标的更快方法是什么?

Use groupby and rank :使用groupbyrank

df[['pos1', 'pos2']] = (df.groupby('query')[['score1', 'score2']]
                          .rank(method='max', ascending=False)
                          .sub(1).astype(int))
print(df)

# Output
    query     score1    score2   key  pos1  pos2
0  query0  97.149704  1.317513  key1     0     0
1  query1  86.344880  1.237784  key2     0     1
2  query2  85.192480  1.312714  key3     1     0
3  query1  86.240326  1.317513  key4     1     0
4  query2  85.492410  1.212714  key5     0     1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将新的Pandas DataFrame附加到旧的,而没有对列名称进行排序的情况 - Append new Pandas DataFrame to an old one without column names sorted 使用 multiIndex 在 Pandas 数据框中获取未排序的列级别 - Get non sorted column levels in pandas dataframe with multiIndex 获取由pandas中的另一列排序的列分区的第一次出现 - get the first occurrence partitioned by a column sorted by another column in pandas 如何生成一个新列,从 pandas DataFrame 中的原始列中减去已排序的列? - How do I generate a new column subtracting the sorted one from the original in a pandas DataFrame? Python Pandas:如何在数据框的列中拆分排序的字典 - Python Pandas: How to split a sorted dictionary in a column of a dataframe 在 Pandas Dataframe 的排序列中查找缺失的数字 - Find missing numbers in a sorted column in Pandas Dataframe GroupBy一列,对pandas中另一列分组记录进行自定义操作 - GroupBy one column, custom operation on another column of grouped records in pandas reindex排序的pandas数据帧 - reindex sorted pandas dataframe Pandas dataframe 未正确排序 - Pandas dataframe not properly sorted 如何创建指示符列以指示数据框中先前条目的特定更改,其中按 ID 对其进行排序和分组? - How to create an indicator column to indicate specific change from a previous entry in a dataframe where it's sorted and grouped by ID?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM