Pandas: create a column in one dataframe based on matching values in multiple columns of a different dataframe

Question

I have data on course attendance and my goal is to get counts of attendees for each course. Unfortunately, the person who led the course is also in the data and needs to be removed. I can't just remove all rows with that person's name because if they attended a course led by someone else, they should count as an attendee.

I have two dataframes:

new_data :

|name | email | file | course | date   |
|-----|-------|------|--------|--------|
|jo   |j@c.i  |one   |A       |6/10/20 |
|bo   |b@c.i  |one   |A       |6/10/20 |
|bo   |b@c.i  |one   |B       |6/11/20 |
|mo   |m@c.i  |one   |B       |6/11/20 |

map_data :

|lead | course | date   |
|-----|--------|--------|
|jo   |A       |6/10/20 |
|bo   |B       |6/11/20 |
|mo   |B       |6/11/20 |

I need to create a new column in new_data to flag whether someone was a lead. There is a lookup table map_data that indicates who led each session.

This is what the output should look like:

|name | email | file | course | date   | lead |
|-----|-------|------|--------|--------|------|
|jo   |j@c.i  |one   |A       |6/10/20 |1     |
|bo   |b@c.i  |one   |A       |6/10/20 |0     |
|bo   |b@c.i  |one   |B       |6/11/20 |1     |
|mo   |m@c.i  |one   |B       |6/11/20 |1     |

Notice that bo is not a lead in course A , but is in B .

Edit : some courses have multiple leads: B has two. This has led to duplication issues in some of my attempts to solve this problem using the suggested solutions in this thread.

This is a limited example, but different people run the same course on different days. jo might run course A on a different date.

For each row in new_data , I need to mark new_data["lead"] as 1 if the name , course , and date match the values in map_data . In all other cases, new_data["lead"] should be 0 .

I am stuck because I don't know how to do the lookup between dataframes using three columns.

Answer 1

Would something like this work?

tmp = new_data.set_index(["name","course", "date"]).join(map_data.set_index(["lead","course", "date"]))

tmp["is_lead"] = tmp["name"] == tmp["lead"]
tmp["is_lead"] = tmp["is_lead"].astype('int')

Answer 2

Here is a function that might help:

def lead(df, df_map):
# Get the leads names, course and date in a single string, like a code. e.g 'joA6/10/20'
leads = [str(df_map.lead[j])+str(df_map.course[j])+str(df_map.date[j]) for j in range(df_map.shape[0])]
# loop to create the data for LEAD column                                                       
lead_col = [1 if str(df.name[i])+str(df.course[i])+str(df.date[i]) in leads else 0 for i in range(df.shape[0])]
# insert LEAD column in the df and return
df['lead'] = lead_col
return df

My input example:

name    email   file    course  date
jo      j@c.i   one     A       6/10/20
bo      b@c.i   one     B       6/11/20
bo      b@c.i   one     B       6/10/20
mo      mo@i    one     B       6/10/20
jay     j@i     one     B       6/11/20

Map:

lead    course  date
jo      A       6/10/20
bo      B       6/11/20
mo      B       6/10/20

Output:

name    email   file    course  date      lead
jo      j@c.i   one     A       6/10/20     1
bo      b@c.i   one     B       6/11/20     1
bo      b@c.i   one     B       6/10/20     0
mo      mo@i    one     B       6/10/20     1
jay     j@i     one     B       6/11/20     0

Answer 3

Use pd.crosstab() , that will tabulate frequency of leadership. stack and rename columns appropriately. This gives forth a new dataframe which you join to new_data using .combine_first() . This appends all the rows arising from crosstab. Drop any NaNs.

Please note df=map_data :

Chained solution

new_data.combine_first(pd.crosstab([df.lead, df.course], df.date).stack().reset_index().rename(columns={'lead':'name',0:'lead'})).dropna()

Step by step solution

    #Crosstab
 df3=pd.crosstab([df.lead, df.course], df.date).stack().reset_index().rename(columns={'lead':'name',0:'lead'})
    #Combine_first
 res=new_data.combine_first(df3).dropna()
 print(res)



 course     date  email file  lead name
0      A  6/10/20  j@c.i  one   0.0   jo
1      A  6/10/20  b@c.i  one   1.0   bo
2      B  6/11/20  b@c.i  one   1.0   bo

Pandas: create a column in one dataframe based on matching values in multiple columns of a different dataframe

Question

3 answers

solution1
0 2020-07-01 19:59:20

solution2
0 2020-07-01 20:52:21

solution3
0 2020-07-01 21:34:43

Pandas: create a column in one dataframe based on matching values in multiple columns of a different dataframe

Question

3 answers

solution1 0 2020-07-01 19:59:20

solution2 0 2020-07-01 20:52:21

solution3 0 2020-07-01 21:34:43

solution1
0 2020-07-01 19:59:20

solution2
0 2020-07-01 20:52:21

solution3
0 2020-07-01 21:34:43