简体   繁体   中英

Populate one dataframe based on group values from another

I have one Dataframe data

   groupId service local
0        1      s1    l1
1        1      s1    l1
2        1      s2    l2
3        1      s3    l3
4        2      s2    l2
5        2      s3    l3
6        3      s1    l1
7        3      s2    l2

and I have a Dataframe question

   q1  q2  howManyGroups
0  s1  l1              0
1  s1  s2              0
2  s2  l2              0
3  s3  l3              0
4  s3  l1              0

I wanna count the occurrences of question rows based on how many groups in data they appear:

   q1  q2  howManyGroups
0  s1  l1              2
1  s1  s2              2
2  s2  l2              3
3  s3  l3              2
4  s3  l1              1

I am using this code, but it is really slow:

for i,g in data.groupby('groupId'):
  for j,r in question.iterrows():
    if set(r[['q1','q2']].values).issubset(set( g.drop('groupId', axis=1).values.ravel())):
      question.loc[j,'howManyGroups'] += 1

Edit: My question dataframe can some times have more/less columns than q1 and q2 . Sometimes it has only q1 , sometimes it has q1, q2, q3 ...

What you can do is first reshaping data to get a row per groupId and unique values in any column service or local.

data_ = (data.set_index('groupId').stack()
             .reset_index(name='h')
             [['groupId', 'h']].drop_duplicates()
        )
print (data_.head())
   groupId   h
0        1  s1
1        1  l1
4        1  s2
5        1  l2
6        1  s3

then use question and merge twice, the first time only on q1 (and h in data_) to get which groupId are associated with the q1, and the second time on q2 and groupId to ensure that both q1 and q2 are in the same group. Finally, groupby the original index you kept with reset_index before the merges and use nunique on groupId:

question['howManyGroups'] = (question[['q1','q2']].reset_index()
                                .merge(data_, left_on=['q1'], right_on=['h'])
                                .merge(data_, left_on=['q2','groupId'], 
                                              right_on=['h','groupId'])
                                .groupby('index')['groupId'].nunique()
                            )
print (question)
   q1  q2  howManyGroups
0  s1  l1              2
1  s1  s2              2
2  s2  l2              3
3  s3  l3              2
4  s3  l1              1

If you have a unknown number of qi, you could try something like:

df_tmp = (question.reset_index()
                  .merge(data_, left_on=['q1'], right_on=['h'])
         )

l_q = question.filter(regex='q\d*').columns.tolist()
l_q.remove('q1')

for q in l_q:
    df_tmp = df_tmp.merge(data_, left_on=[q,'groupId'], right_on=['h', 'groupId'])

question['howManyGroups'] = df_tmp.groupby('index')['groupId'].nunique()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM