I have data on course attendance and my goal is to get counts of attendees for each course. Unfortunately, the person who led the course is also in the data and needs to be removed. I can't just remove all rows with that person's name because if they attended a course led by someone else, they should count as an attendee.
I have two dataframes:
new_data
:
|name | email | file | course | date |
|-----|-------|------|--------|--------|
|jo |j@c.i |one |A |6/10/20 |
|bo |b@c.i |one |A |6/10/20 |
|bo |b@c.i |one |B |6/11/20 |
|mo |m@c.i |one |B |6/11/20 |
map_data
:
|lead | course | date |
|-----|--------|--------|
|jo |A |6/10/20 |
|bo |B |6/11/20 |
|mo |B |6/11/20 |
I need to create a new column in new_data
to flag whether someone was a lead. There is a lookup table map_data
that indicates who led each session.
This is what the output should look like:
|name | email | file | course | date | lead |
|-----|-------|------|--------|--------|------|
|jo |j@c.i |one |A |6/10/20 |1 |
|bo |b@c.i |one |A |6/10/20 |0 |
|bo |b@c.i |one |B |6/11/20 |1 |
|mo |m@c.i |one |B |6/11/20 |1 |
Notice that bo
is not a lead in course
A
, but is in B
.
Edit : some courses have multiple leads: B
has two. This has led to duplication issues in some of my attempts to solve this problem using the suggested solutions in this thread.
This is a limited example, but different people run the same course on different days. jo
might run course
A
on a different date.
For each row in new_data
, I need to mark new_data["lead"]
as 1
if the name
, course
, and date
match the values in map_data
. In all other cases, new_data["lead"]
should be 0
.
I am stuck because I don't know how to do the lookup between dataframes using three columns.
Would something like this work?
tmp = new_data.set_index(["name","course", "date"]).join(map_data.set_index(["lead","course", "date"]))
tmp["is_lead"] = tmp["name"] == tmp["lead"]
tmp["is_lead"] = tmp["is_lead"].astype('int')
Here is a function that might help:
def lead(df, df_map):
# Get the leads names, course and date in a single string, like a code. e.g 'joA6/10/20'
leads = [str(df_map.lead[j])+str(df_map.course[j])+str(df_map.date[j]) for j in range(df_map.shape[0])]
# loop to create the data for LEAD column
lead_col = [1 if str(df.name[i])+str(df.course[i])+str(df.date[i]) in leads else 0 for i in range(df.shape[0])]
# insert LEAD column in the df and return
df['lead'] = lead_col
return df
My input example:
name email file course date
jo j@c.i one A 6/10/20
bo b@c.i one B 6/11/20
bo b@c.i one B 6/10/20
mo mo@i one B 6/10/20
jay j@i one B 6/11/20
Map:
lead course date
jo A 6/10/20
bo B 6/11/20
mo B 6/10/20
Output:
name email file course date lead
jo j@c.i one A 6/10/20 1
bo b@c.i one B 6/11/20 1
bo b@c.i one B 6/10/20 0
mo mo@i one B 6/10/20 1
jay j@i one B 6/11/20 0
Use pd.crosstab() , that will tabulate frequency of leadership. stack and rename columns appropriately. This gives forth a new dataframe which you join to new_data using .combine_first() . This appends all the rows arising from crosstab. Drop any NaNs.
Please note df=map_data
:
Chained solution
new_data.combine_first(pd.crosstab([df.lead, df.course], df.date).stack().reset_index().rename(columns={'lead':'name',0:'lead'})).dropna()
Step by step solution
#Crosstab
df3=pd.crosstab([df.lead, df.course], df.date).stack().reset_index().rename(columns={'lead':'name',0:'lead'})
#Combine_first
res=new_data.combine_first(df3).dropna()
print(res)
course date email file lead name
0 A 6/10/20 j@c.i one 0.0 jo
1 A 6/10/20 b@c.i one 1.0 bo
2 B 6/11/20 b@c.i one 1.0 bo
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.