简体   繁体   English

数据帧中行和列之间的Python交互

[英]Python interaction between rows and columns in dataframe

I have a dataframe: 我有一个数据帧:

df = pd.DataFrame({
    'exam': [
        'French', 'English', 'German', 'Russian', 'Russian',
        'German', 'German', 'French', 'English', 'French'
    ],

'student' : ['john', 'ted', 'jason', 'marc', 'peter', 'bob',
            'robert', 'david', 'nik', 'kevin'
]
})

print (df)

              exam   student   
    0       French    john     
    1       English   ted        
    2       German    jason         
    3       Russian   marc         
    4       Russian   peter         
    5       German    bob         
    6       German    robert         
    7       French    david         
    8       English   nik          
    9       French    kevin         

Does anybody know how to create a new dataframe containing two columns "student" and "student shared exam with". 有没有人知道如何创建一个包含两列“学生”和“学生共享考试”的新数据框。

I should get something like: 我应该得到类似的东西:

                student   shared_exam_with      
        0       john       david                   
        1       john       kevin            
        2       ted        nik                    
        3       jason      bob                 
        4       jason      robert                   
        5       marc       peter              
        6       peter      marc             
        7       bob        jason                    
        8       bob        robert                    
        9       robert     jason                      
       10       robert     bob                   
       11       david      john             
       12       david      kevin                      
       13       nik        ted                     
       14       kevin      john                     
       15       kevin      david                   

For ex: John took French..and David and Kevin too! 对于前:约翰带法国人......还有大卫和凯文!

Any ideas? 有任何想法吗? Thank you in advance! 先感谢您!

self merge 自我merge

df.merge(
    df, on='exam',
    suffixes=['', '_shared_with']
).query('student != student_shared_with')

       exam student student_shared_with
1    French    john               david
2    French    john               kevin
3    French   david                john
5    French   david               kevin
6    French   kevin                john
7    French   kevin               david
10  English     ted                 nik
11  English     nik                 ted
14   German   jason                 bob
15   German   jason              robert
16   German     bob               jason
18   German     bob              robert
19   German  robert               jason
20   German  robert                 bob
23  Russian    marc               peter
24  Russian   peter                marc

self join 自我join

d1 = df.set_index('exam')
d1.join(
    d1, rsuffix='_shared_with'
).query('student != student_shared_with')

        student student_shared_with
exam                               
English     ted                 nik
English     nik                 ted
French     john               david
French     john               kevin
French    david                john
French    david               kevin
French    kevin                john
French    kevin               david
German    jason                 bob
German    jason              robert
German      bob               jason
German      bob              robert
German   robert               jason
German   robert                 bob
Russian    marc               peter
Russian   peter                marc

itertools.permutations + groupby itertools.permutations + groupby

from itertools import permutations as perm

cols = ['student', 'student_shared_with']
df.groupby('exam').student.apply(
    lambda x: pd.DataFrame(list(perm(x, 2)), columns=cols)
).reset_index(drop=True)

   student student_shared_with
0      ted                 nik
1      nik                 ted
2     john               david
3     john               kevin
4    david                john
5    david               kevin
6    kevin                john
7    kevin               david
8    jason                 bob
9    jason              robert
10     bob               jason
11     bob              robert
12  robert               jason
13  robert                 bob
14    marc               peter
15   peter                marc

One way would be: 一种方法是:

cross = pd.crosstab(df['student'], df['exam'])
res = cross.dot(cross.T)
res.where(np.triu(res, k=1).astype('bool')).stack()
Out: 
student  student
bob      jason      1.0
         robert     1.0
david    john       1.0
         kevin      1.0
jason    robert     1.0
john     kevin      1.0
marc     peter      1.0
nik      ted        1.0
dtype: float64

The dot product generates a binary matrix for the co occurrences. 点积产生共生的二元矩阵。 In order not to repeat the same pairs, I filter them with where and stack. 为了不重复相同的对,我用where和stack过滤它们。 The index of the resulting Series is the students that have the same exam. 得到的系列的索引是具有相同考试的学生。

This would be a one step process in SQL, but here it's two: (1) merge the DataFrame (on exam) with itself, and (2) get rid of rows were student == student_shared (since a student doesn't share with themself) 这将是SQL中的一个步骤,但这里有两个:(1)将DataFrame(在考试中)与自身合并,以及(2)删除行是student == student_shared(因为学生不与之分享自理)

df2 = pd.merge(
    df, df, how='outer', on='exam', suffixes=['', '_shared_with']).drop('exam', axis=1)
df2 = df2.loc[df2.student != df2.student_shared_with]

   student student_shared_with
1     john               david
2     john               kevin
3    david                john
5    david               kevin
6    kevin                john
7    kevin               david
10     ted                 nik
11     nik                 ted
14   jason                 bob
15   jason              robert
16     bob               jason
18     bob              robert
19  robert               jason
20  robert                 bob
23    marc               peter
24   peter                marc

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM