简体   繁体   中英

How to count how many times an entity appears with another entity

I have the following dataframe:

df = pd.DataFrame([[1, 2], [1, 3], [4, 6], [4, 7]], columns=['group_id', 'student_id'])

Each student_id can appear multiple times in different group_id s with others student_id s.

I want to count how many times student x was in the same group as student y . In other words, I want anxn DF where each entry is the number of times 2 students have been on the same group (same group_id , when no match, fill with 0).

2 2 3 4 5 6 7
3 1 0 0 0 0 0
4 0 0 0 0 0 0
5 0 0 0 0 0 0
6 0 0 0 0 0 1    
7 0 0 0 0 1 0

Any way I can do it in a cleaver way with SQL or Pandas?

Thanks

Do with numpy outer

s = df.group_id.to_numpy()
yourdf = pd.DataFrame(np.equal.outer(s,s),index=df.student_id,columns=df.student_id).astype(int)
yourdf
Out[40]: 
student_id  2  3  6  7
student_id            
2           1  1  0  0
3           1  1  0  0
6           0  0  1  1
7           0  0  1  1

Or do

freq = pd.crosstab(df['group_id'],df['student_id'])
yourdf = freq.T.dot(freq)
Out[45]: 
student_id  2  3  6  7
student_id            
2           1  1  0  0
3           1  1  0  0
6           0  0  1  1
7           0  0  1  1

you can merge and then pivot_table :

df_ = (df.merge(df, on='group_id')
         .pivot_table(index='student_id_x', columns='student_id_y', 
                      values='group_id', aggfunc='nunique').fillna(0)
         .astype(int)
      )
print (df_)
student_id_y  2  3  6  7
student_id_x            
2             1  1  0  0
3             1  1  0  0
6             0  0  1  1
7             0  0  1  1

You can do:

# make dummy cols in the dataframe
df['student_id_2'] = df['student_id'].copy()
df['flag'] = 1

dx = (df
      .drop('group_id', 1)
      .set_index(['student_id', 'student_id_2'])
      .unstack(-1)
      .fillna(0))

# fix column names
dx.columns.names = None, None
dx.columns = [x[1] for x in dx.columns]

print(dx)
                            
                2    3    6    7
student_id                      
2             1.0  0.0  0.0  0.0
3             0.0  1.0  0.0  0.0
6             0.0  0.0  1.0  0.0
7             0.0  0.0  0.0  1.0

To present a more instructive example (better filled), I prepared a bit bigger source DataFrame:

   group_id  student_id
0         1           2
1         1           3
2         2           2
3         2           6
4         3           3
5         3           2
6         4           6
7         4           7

To get the result, run:

stId = df.student_id.unique()
result = pd.DataFrame(0, index=stId, columns=stId)
for s1, s2 in df.groupby('group_id').student_id.apply(list):
    result.loc[s2, s1] += 1
    result.loc[s1, s2] += 1

When you print the result, you will get:

   2  3  6  7
2  0  2  1  0
3  2  0  0  0
6  1  0  0  1
7  0  0  1  0

As you can see:

  • students 2 and 3 took part 2 times in the same group,
  • student 6 was once in pair with student 2 ,
  • student 6 was once in pair with student 7 .

In my opinion, there is something wrong in each solution showing that a student was in one group with himself.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM