简体   繁体   English

Pandas:根据不同dataframe的多列中的匹配值,在一个dataframe中创建一列

[英]Pandas: create a column in one dataframe based on matching values in multiple columns of a different dataframe

I have data on course attendance and my goal is to get counts of attendees for each course.我有关于课程出勤率的数据,我的目标是获取每门课程的出席人数。 Unfortunately, the person who led the course is also in the data and needs to be removed.不幸的是,领导课程的人也在数据中,需要删除。 I can't just remove all rows with that person's name because if they attended a course led by someone else, they should count as an attendee.我不能只删除包含该人姓名的所有行,因为如果他们参加了由其他人主持的课程,他们应该算作参加者。

I have two dataframes:我有两个数据框:

new_data : new_data

|name | email | file | course | date   |
|-----|-------|------|--------|--------|
|jo   |j@c.i  |one   |A       |6/10/20 |
|bo   |b@c.i  |one   |A       |6/10/20 |
|bo   |b@c.i  |one   |B       |6/11/20 |
|mo   |m@c.i  |one   |B       |6/11/20 |

map_data : map_data

|lead | course | date   |
|-----|--------|--------|
|jo   |A       |6/10/20 |
|bo   |B       |6/11/20 |
|mo   |B       |6/11/20 |

I need to create a new column in new_data to flag whether someone was a lead.我需要在new_data中创建一个新列来标记某人是否是潜在客户。 There is a lookup table map_data that indicates who led each session.有一个查找表map_data指示谁领导了每个 session。

This is what the output should look like:这就是 output 的样子:

|name | email | file | course | date   | lead |
|-----|-------|------|--------|--------|------|
|jo   |j@c.i  |one   |A       |6/10/20 |1     |
|bo   |b@c.i  |one   |A       |6/10/20 |0     |
|bo   |b@c.i  |one   |B       |6/11/20 |1     |
|mo   |m@c.i  |one   |B       |6/11/20 |1     |

Notice that bo is not a lead in course A , but is in B .请注意, bo不是course A的主角,而是B中的主角。

Edit : some courses have multiple leads: B has two.编辑:有些课程有多个线索: B有两个。 This has led to duplication issues in some of my attempts to solve this problem using the suggested solutions in this thread.这导致在我使用此线程中建议的解决方案解决此问题的一些尝试中出现重复问题。

This is a limited example, but different people run the same course on different days.这是一个有限的例子,但不同的人在不同的日子跑相同的课程。 jo might run course A on a different date. jo可能会在不同的日期运行course A

For each row in new_data , I need to mark new_data["lead"] as 1 if the name , course , and date match the values in map_data .对于new_data中的每一行,如果namecoursedatemap_data中的值匹配,我需要将new_data["lead"]标记为1 In all other cases, new_data["lead"] should be 0 .在所有其他情况下, new_data["lead"]应该是0

I am stuck because I don't know how to do the lookup between dataframes using three columns.我被卡住了,因为我不知道如何使用三列在数据帧之间进行查找。

Would something like this work?像这样的东西会起作用吗?

tmp = new_data.set_index(["name","course", "date"]).join(map_data.set_index(["lead","course", "date"]))

tmp["is_lead"] = tmp["name"] == tmp["lead"]
tmp["is_lead"] = tmp["is_lead"].astype('int')

Here is a function that might help:这是一个可能有帮助的 function:

def lead(df, df_map):
# Get the leads names, course and date in a single string, like a code. e.g 'joA6/10/20'
leads = [str(df_map.lead[j])+str(df_map.course[j])+str(df_map.date[j]) for j in range(df_map.shape[0])]
# loop to create the data for LEAD column                                                       
lead_col = [1 if str(df.name[i])+str(df.course[i])+str(df.date[i]) in leads else 0 for i in range(df.shape[0])]
# insert LEAD column in the df and return
df['lead'] = lead_col
return df

My input example:我的输入示例:

name    email   file    course  date
jo      j@c.i   one     A       6/10/20
bo      b@c.i   one     B       6/11/20
bo      b@c.i   one     B       6/10/20
mo      mo@i    one     B       6/10/20
jay     j@i     one     B       6/11/20

Map: Map:

lead    course  date
jo      A       6/10/20
bo      B       6/11/20
mo      B       6/10/20

Output: Output:

name    email   file    course  date      lead
jo      j@c.i   one     A       6/10/20     1
bo      b@c.i   one     B       6/11/20     1
bo      b@c.i   one     B       6/10/20     0
mo      mo@i    one     B       6/10/20     1
jay     j@i     one     B       6/11/20     0

Use pd.crosstab() , that will tabulate frequency of leadership.使用pd.crosstab() ,这将把领导频率制成表格。 stack and rename columns appropriately.适当地堆叠重命名列 This gives forth a new dataframe which you join to new_data using .combine_first() .这给出了一个新的 dataframe ,您可以使用.combine_first()将其加入 new_data 。 This appends all the rows arising from crosstab.这会附加由交叉表产生的所有行。 Drop any NaNs.删除任何 NaN。

Please note df=map_data :请注意df=map_data

Chained solution链式解决方案

new_data.combine_first(pd.crosstab([df.lead, df.course], df.date).stack().reset_index().rename(columns={'lead':'name',0:'lead'})).dropna()

Step by step solution分步解决

    #Crosstab
 df3=pd.crosstab([df.lead, df.course], df.date).stack().reset_index().rename(columns={'lead':'name',0:'lead'})
    #Combine_first
 res=new_data.combine_first(df3).dropna()
 print(res)



 course     date  email file  lead name
0      A  6/10/20  j@c.i  one   0.0   jo
1      A  6/10/20  b@c.i  one   1.0   bo
2      B  6/11/20  b@c.i  one   1.0   bo

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何根据不同列中的值向 pandas dataframe 添加一列? - How to add one column to pandas dataframe based on values in different columns? 根据不同数据框中的匹配值,将摘要列添加到pandas数据框中 - Add summary columns to a pandas dataframe based on matching values in a different dataframe 根据Pandas DataFrame中单个列中的值创建多个列 - Create multiple columns based on values in single column in Pandas DataFrame 熊猫:在一个数据框中创建新列,并根据与另一个数据框中的匹配键进行匹配 - Pandas: create new column in one dataframe with values based on matching key from another dataframe 熊猫通过将数据框列与其他多个列进行匹配来生成列 - Pandas generates a column based by matching the dataframe columns to multiple other columns 根据其他列中的“NaN”值在 Pandas Dataframe 中创建一个新列 - Create a new column in Pandas Dataframe based on the 'NaN' values in other columns 根据Pandas Dataframe中一列中的字符串将值传递给新列 - Passing values to new columns based on string in one column in a Pandas Dataframe 根据 pandas dataframe 中的其他三列更改一列的值 - Changing values of one column based on the other three columns in pandas dataframe Pandas-根据特定列的值在DataFrame中创建单独的列 - Pandas - Create Separate Columns in DataFrame Based on a Specific Column's Values 根据一列的排序对多个 Pandas Dataframe 列进行排序 - Sorting multiple Pandas Dataframe Columns based on the sorting of one column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM