简体   繁体   English

Pandas dataframe 条件内连接自身

[英]Pandas dataframe conditional inner join with itself

I am searching for a way to inner join a column of a dataframe with itself, based on a condition.我正在寻找一种方法,根据条件将 dataframe 的列与自身进行内部连接。 I have a large dataframe consisting of two colums, 'Group' and 'Person'.我有一个大的 dataframe 由两个列组成,“组”和“人”。 Now I would like to create a second dataframe, which has an entry for every person tuple, that has been in the same group.现在我想创建第二个 dataframe,它对每个人的元组都有一个条目,它在同一个组中。 First dataframe:首先dataframe:

    Group | Person
    a1    | p1
    a1    | p2
    a1    | p3
    a1    | p4
    a2    | p1

Output: Output:

    Person1 | Person2 | Weight
    p1      | p2      | 1
    p1      | p3      | 1
    p1      | p4      | 1
    p2      | p3      | 1
    p2      | p4      | 1
    p3      | p4      | 1

The weight is increased, if a tuple of persons are part of multiple groups.如果一组人是多个组的一部分,则权重会增加。 So far, I was able to create a naive implementation, based on a sub dataframe and two for loops.到目前为止,我能够基于子 dataframe 和两个 for 循环创建一个简单的实现。 Is there a more elegant and more importantly, a faster/builtin way to do so?有没有更优雅,更重要的是,更快/内置的方式来做到这一点?

My implentation so far:到目前为止我的实现:

    group = principals.iloc[i,0]

    sub = principals.loc[principals['Group'] == group]
    
    for j in range(len(sub)-1):
        for k in range (j+1,len(sub)):
            #check if tuple exists -> update or create new entry

I was thinking, whether there is a functionality similar to SQL inner join, based on the condition of the group being the same and then joining person against person.我在想,是否有类似 SQL 内连接的功能,基于组相同的条件,然后以人对人的方式加入。 I could take care of the double p1|p1 entry in that case...在这种情况下,我可以处理双 p1|p1 条目......

Many thanks in advance提前谢谢了

combinations will give you the tuple pairs you are looking for. combinations将为您提供您正在寻找的元组对。 Once you get those you can explode the tuple combinations into rows.一旦你得到这些,你可以将元组组合分解成行。 Then your weight is the group size of each pair - in this case 1 because they all exist only in one group.那么你的weight就是每对的组大小——在这种情况下是 1,因为它们都只存在于一个组中。

import pandas as pd
import numpy as np
from itertools import combinations

df = pd.DataFrame({'Group': ['a1', 'a1', 'a1', 'a1', 'a2'],
 'Person': ['p1', 'p2', 'p3', 'p4', 'p1']})

df = (
    df.groupby('Group')['Person']
      .apply(lambda x: tuple(combinations(x,2)))
      .explode()
      .dropna()
      .reset_index()
)

df['Weight'] = df.groupby('Person').transform(np.size)
df[['Person1','Person2']] = df['Person'].apply(pd.Series)

df = df[['Person1','Person2','Weight']]

print(df)

Output Output

  Person1 Person2  Weight
0      p1      p2       1
1      p1      p3       1
2      p1      p4       1
3      p2      p3       1
4      p2      p4       1
5      p3      p4       1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM