Pandas：根据不同dataframe的多列中的匹配值，在一个dataframe中创建一列

Question

I have data on course attendance and my goal is to get counts of attendees for each course.我有关于课程出勤率的数据，我的目标是获取每门课程的出席人数。 Unfortunately, the person who led the course is also in the data and needs to be removed.不幸的是，领导课程的人也在数据中，需要删除。 I can't just remove all rows with that person's name because if they attended a course led by someone else, they should count as an attendee.我不能只删除包含该人姓名的所有行，因为如果他们参加了由其他人主持的课程，他们应该算作参加者。

I have two dataframes:我有两个数据框：

new_data : new_data ：

|name | email | file | course | date   |
|-----|-------|------|--------|--------|
|jo   |j@c.i  |one   |A       |6/10/20 |
|bo   |b@c.i  |one   |A       |6/10/20 |
|bo   |b@c.i  |one   |B       |6/11/20 |
|mo   |m@c.i  |one   |B       |6/11/20 |

map_data : map_data ：

|lead | course | date   |
|-----|--------|--------|
|jo   |A       |6/10/20 |
|bo   |B       |6/11/20 |
|mo   |B       |6/11/20 |

I need to create a new column in new_data to flag whether someone was a lead.我需要在new_data中创建一个新列来标记某人是否是潜在客户。 There is a lookup table map_data that indicates who led each session.有一个查找表map_data指示谁领导了每个 session。

This is what the output should look like:这就是 output 的样子：

|name | email | file | course | date   | lead |
|-----|-------|------|--------|--------|------|
|jo   |j@c.i  |one   |A       |6/10/20 |1     |
|bo   |b@c.i  |one   |A       |6/10/20 |0     |
|bo   |b@c.i  |one   |B       |6/11/20 |1     |
|mo   |m@c.i  |one   |B       |6/11/20 |1     |

Notice that bo is not a lead in course A , but is in B .请注意， bo不是course A的主角，而是B中的主角。

Edit : some courses have multiple leads: B has two.编辑：有些课程有多个线索： B有两个。 This has led to duplication issues in some of my attempts to solve this problem using the suggested solutions in this thread.这导致在我使用此线程中建议的解决方案解决此问题的一些尝试中出现重复问题。

This is a limited example, but different people run the same course on different days.这是一个有限的例子，但不同的人在不同的日子跑相同的课程。 jo might run course A on a different date. jo可能会在不同的日期运行course A

For each row in new_data , I need to mark new_data["lead"] as 1 if the name , course , and date match the values in map_data .对于new_data中的每一行，如果name 、 course和date与map_data中的值匹配，我需要将new_data["lead"]标记为1 。 In all other cases, new_data["lead"] should be 0 .在所有其他情况下， new_data["lead"]应该是0 。

I am stuck because I don't know how to do the lookup between dataframes using three columns.我被卡住了，因为我不知道如何使用三列在数据帧之间进行查找。

Answer 1

Would something like this work?像这样的东西会起作用吗？

tmp = new_data.set_index(["name","course", "date"]).join(map_data.set_index(["lead","course", "date"]))

tmp["is_lead"] = tmp["name"] == tmp["lead"]
tmp["is_lead"] = tmp["is_lead"].astype('int')

Answer 2

Here is a function that might help:这是一个可能有帮助的 function：

def lead(df, df_map):
# Get the leads names, course and date in a single string, like a code. e.g 'joA6/10/20'
leads = [str(df_map.lead[j])+str(df_map.course[j])+str(df_map.date[j]) for j in range(df_map.shape[0])]
# loop to create the data for LEAD column                                                       
lead_col = [1 if str(df.name[i])+str(df.course[i])+str(df.date[i]) in leads else 0 for i in range(df.shape[0])]
# insert LEAD column in the df and return
df['lead'] = lead_col
return df

My input example:我的输入示例：

name    email   file    course  date
jo      j@c.i   one     A       6/10/20
bo      b@c.i   one     B       6/11/20
bo      b@c.i   one     B       6/10/20
mo      mo@i    one     B       6/10/20
jay     j@i     one     B       6/11/20

Map: Map：

lead    course  date
jo      A       6/10/20
bo      B       6/11/20
mo      B       6/10/20

Output: Output：

name    email   file    course  date      lead
jo      j@c.i   one     A       6/10/20     1
bo      b@c.i   one     B       6/11/20     1
bo      b@c.i   one     B       6/10/20     0
mo      mo@i    one     B       6/10/20     1
jay     j@i     one     B       6/11/20     0

Answer 3

Use pd.crosstab() , that will tabulate frequency of leadership.使用pd.crosstab() ，这将把领导频率制成表格。 stack and rename columns appropriately.适当地堆叠和重命名列。 This gives forth a new dataframe which you join to new_data using .combine_first() .这给出了一个新的 dataframe ，您可以使用.combine_first()将其加入 new_data 。 This appends all the rows arising from crosstab.这会附加由交叉表产生的所有行。 Drop any NaNs.删除任何 NaN。

Please note df=map_data :请注意df=map_data ：

Chained solution链式解决方案

new_data.combine_first(pd.crosstab([df.lead, df.course], df.date).stack().reset_index().rename(columns={'lead':'name',0:'lead'})).dropna()

Step by step solution分步解决

    #Crosstab
 df3=pd.crosstab([df.lead, df.course], df.date).stack().reset_index().rename(columns={'lead':'name',0:'lead'})
    #Combine_first
 res=new_data.combine_first(df3).dropna()
 print(res)



 course     date  email file  lead name
0      A  6/10/20  j@c.i  one   0.0   jo
1      A  6/10/20  b@c.i  one   1.0   bo
2      B  6/11/20  b@c.i  one   1.0   bo

Pandas：根据不同dataframe的多列中的匹配值，在一个dataframe中创建一列

问题描述

3 个解决方案

解决方案1
0 2020-07-01 19:59:20

解决方案2
0 2020-07-01 20:52:21

解决方案3
0 2020-07-01 21:34:43

Pandas：根据不同dataframe的多列中的匹配值，在一个dataframe中创建一列

问题描述

3 个解决方案

解决方案1 0 2020-07-01 19:59:20

解决方案2 0 2020-07-01 20:52:21

解决方案3 0 2020-07-01 21:34:43

解决方案1
0 2020-07-01 19:59:20

解决方案2
0 2020-07-01 20:52:21

解决方案3
0 2020-07-01 21:34:43