[英]Pandas: create a column in one dataframe based on matching values in multiple columns of a different dataframe
I have data on course attendance and my goal is to get counts of attendees for each course.我有关于课程出勤率的数据,我的目标是获取每门课程的出席人数。 Unfortunately, the person who led the course is also in the data and needs to be removed.
不幸的是,领导课程的人也在数据中,需要删除。 I can't just remove all rows with that person's name because if they attended a course led by someone else, they should count as an attendee.
我不能只删除包含该人姓名的所有行,因为如果他们参加了由其他人主持的课程,他们应该算作参加者。
I have two dataframes:我有两个数据框:
new_data
: new_data
:
|name | email | file | course | date |
|-----|-------|------|--------|--------|
|jo |j@c.i |one |A |6/10/20 |
|bo |b@c.i |one |A |6/10/20 |
|bo |b@c.i |one |B |6/11/20 |
|mo |m@c.i |one |B |6/11/20 |
map_data
: map_data
:
|lead | course | date |
|-----|--------|--------|
|jo |A |6/10/20 |
|bo |B |6/11/20 |
|mo |B |6/11/20 |
I need to create a new column in new_data
to flag whether someone was a lead.我需要在
new_data
中创建一个新列来标记某人是否是潜在客户。 There is a lookup table map_data
that indicates who led each session.有一个查找表
map_data
指示谁领导了每个 session。
This is what the output should look like:这就是 output 的样子:
|name | email | file | course | date | lead |
|-----|-------|------|--------|--------|------|
|jo |j@c.i |one |A |6/10/20 |1 |
|bo |b@c.i |one |A |6/10/20 |0 |
|bo |b@c.i |one |B |6/11/20 |1 |
|mo |m@c.i |one |B |6/11/20 |1 |
Notice that bo
is not a lead in course
A
, but is in B
.请注意,
bo
不是course
A
的主角,而是B
中的主角。
Edit : some courses have multiple leads: B
has two.编辑:有些课程有多个线索:
B
有两个。 This has led to duplication issues in some of my attempts to solve this problem using the suggested solutions in this thread.这导致在我使用此线程中建议的解决方案解决此问题的一些尝试中出现重复问题。
This is a limited example, but different people run the same course on different days.这是一个有限的例子,但不同的人在不同的日子跑相同的课程。
jo
might run course
A
on a different date. jo
可能会在不同的日期运行course
A
For each row in new_data
, I need to mark new_data["lead"]
as 1
if the name
, course
, and date
match the values in map_data
.对于
new_data
中的每一行,如果name
、 course
和date
与map_data
中的值匹配,我需要将new_data["lead"]
标记为1
。 In all other cases, new_data["lead"]
should be 0
.在所有其他情况下,
new_data["lead"]
应该是0
。
I am stuck because I don't know how to do the lookup between dataframes using three columns.我被卡住了,因为我不知道如何使用三列在数据帧之间进行查找。
Would something like this work?像这样的东西会起作用吗?
tmp = new_data.set_index(["name","course", "date"]).join(map_data.set_index(["lead","course", "date"]))
tmp["is_lead"] = tmp["name"] == tmp["lead"]
tmp["is_lead"] = tmp["is_lead"].astype('int')
Here is a function that might help:这是一个可能有帮助的 function:
def lead(df, df_map):
# Get the leads names, course and date in a single string, like a code. e.g 'joA6/10/20'
leads = [str(df_map.lead[j])+str(df_map.course[j])+str(df_map.date[j]) for j in range(df_map.shape[0])]
# loop to create the data for LEAD column
lead_col = [1 if str(df.name[i])+str(df.course[i])+str(df.date[i]) in leads else 0 for i in range(df.shape[0])]
# insert LEAD column in the df and return
df['lead'] = lead_col
return df
My input example:我的输入示例:
name email file course date
jo j@c.i one A 6/10/20
bo b@c.i one B 6/11/20
bo b@c.i one B 6/10/20
mo mo@i one B 6/10/20
jay j@i one B 6/11/20
Map: Map:
lead course date
jo A 6/10/20
bo B 6/11/20
mo B 6/10/20
Output: Output:
name email file course date lead
jo j@c.i one A 6/10/20 1
bo b@c.i one B 6/11/20 1
bo b@c.i one B 6/10/20 0
mo mo@i one B 6/10/20 1
jay j@i one B 6/11/20 0
Use pd.crosstab() , that will tabulate frequency of leadership.使用pd.crosstab() ,这将把领导频率制成表格。 stack and rename columns appropriately.
适当地堆叠和重命名列。 This gives forth a new dataframe which you join to new_data using .combine_first() .
这给出了一个新的 dataframe ,您可以使用.combine_first()将其加入 new_data 。 This appends all the rows arising from crosstab.
这会附加由交叉表产生的所有行。 Drop any NaNs.
删除任何 NaN。
Please note df=map_data
:请注意
df=map_data
:
Chained solution链式解决方案
new_data.combine_first(pd.crosstab([df.lead, df.course], df.date).stack().reset_index().rename(columns={'lead':'name',0:'lead'})).dropna()
Step by step solution分步解决
#Crosstab
df3=pd.crosstab([df.lead, df.course], df.date).stack().reset_index().rename(columns={'lead':'name',0:'lead'})
#Combine_first
res=new_data.combine_first(df3).dropna()
print(res)
course date email file lead name
0 A 6/10/20 j@c.i one 0.0 jo
1 A 6/10/20 b@c.i one 1.0 bo
2 B 6/11/20 b@c.i one 1.0 bo
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.