Populate one dataframe based on group values from another

Question

I have one Dataframe data

   groupId service local
0        1      s1    l1
1        1      s1    l1
2        1      s2    l2
3        1      s3    l3
4        2      s2    l2
5        2      s3    l3
6        3      s1    l1
7        3      s2    l2

and I have a Dataframe question

   q1  q2  howManyGroups
0  s1  l1              0
1  s1  s2              0
2  s2  l2              0
3  s3  l3              0
4  s3  l1              0

I wanna count the occurrences of question rows based on how many groups in data they appear:

   q1  q2  howManyGroups
0  s1  l1              2
1  s1  s2              2
2  s2  l2              3
3  s3  l3              2
4  s3  l1              1

I am using this code, but it is really slow:

for i,g in data.groupby('groupId'):
  for j,r in question.iterrows():
    if set(r[['q1','q2']].values).issubset(set( g.drop('groupId', axis=1).values.ravel())):
      question.loc[j,'howManyGroups'] += 1

Edit: My question dataframe can some times have more/less columns than q1 and q2 . Sometimes it has only q1 , sometimes it has q1, q2, q3 ...

Answer 1

What you can do is first reshaping data to get a row per groupId and unique values in any column service or local.

data_ = (data.set_index('groupId').stack()
             .reset_index(name='h')
             [['groupId', 'h']].drop_duplicates()
        )
print (data_.head())
   groupId   h
0        1  s1
1        1  l1
4        1  s2
5        1  l2
6        1  s3

then use question and merge twice, the first time only on q1 (and h in data_) to get which groupId are associated with the q1, and the second time on q2 and groupId to ensure that both q1 and q2 are in the same group. Finally, groupby the original index you kept with reset_index before the merges and use nunique on groupId:

question['howManyGroups'] = (question[['q1','q2']].reset_index()
                                .merge(data_, left_on=['q1'], right_on=['h'])
                                .merge(data_, left_on=['q2','groupId'], 
                                              right_on=['h','groupId'])
                                .groupby('index')['groupId'].nunique()
                            )
print (question)
   q1  q2  howManyGroups
0  s1  l1              2
1  s1  s2              2
2  s2  l2              3
3  s3  l3              2
4  s3  l1              1

If you have a unknown number of qi, you could try something like:

df_tmp = (question.reset_index()
                  .merge(data_, left_on=['q1'], right_on=['h'])
         )

l_q = question.filter(regex='q\d*').columns.tolist()
l_q.remove('q1')

for q in l_q:
    df_tmp = df_tmp.merge(data_, left_on=[q,'groupId'], right_on=['h', 'groupId'])

question['howManyGroups'] = df_tmp.groupby('index')['groupId'].nunique()

Populate one dataframe based on group values from another

Question

1 answers

solution1
1 ACCPTED 2020-06-08 14:30:46

Populate one dataframe based on group values from another

Question

1 answers

solution1 1 ACCPTED 2020-06-08 14:30:46

solution1
1 ACCPTED 2020-06-08 14:30:46