简体   繁体   English

根据不同数据框中的匹配值,将摘要列添加到pandas数据框中

[英]Add summary columns to a pandas dataframe based on matching values in a different dataframe

I've got a DataFrame that indicates members of a project and the project start date, and a second DataFrame that indicates birth dates. 我有一个表示项目成员和项目开始日期的DataFrame,以及表示出生日期的第二个DataFrame。 I'm trying to add a number of columns indicating the total number of people in certain age groups based on the start of each project. 我正在尝试根据每个项目的开头添加一些列,表明某些年龄组的总人数。

print(projects)
           Start  John  Bob  Gladys
Project                               
A     2014-01-08     1    0       0
B     2016-08-09     0    1       1
C     2018-02-06     0    1       0

print(birthdays)
             birth
name              
John    1983-04-06
Gladys  1969-08-02
Bob     1946-11-03

I've considered using .apply() or .iterrows() methods but I'm having a hard time even wrapping my head around where to start. 我已经考虑过使用.apply().iterrows()方法,但是我很难把头包裹在哪里开始。 The real DataFrames have significantly more columns and rows so I need to avoid needing to call any columns by name. 真正的DataFrame有更多的列和行,所以我需要避免需要按名称调用任何列。

This is what I'm trying to accomplish: 这就是我想要完成的事情:

              Start  John  Bob  Gladys  25-34  35-45  46-55  56+
Project                                                         
A        2014-01-08     1    0       0      1      0      0    0
B        2016-08-09     0    1       1      0      0      1    1
C        2018-02-06     0    1       0      0      0      0    1

Any suggestions on where to start? 有关从哪里开始的任何建议?

here is one way using melt , cut , crosstab and merge : 这是使用meltcutcrosstabmerge一种方式:

##unpivot the dataframe keeping Start as index and keep only value ==1
melt_=projects.melt('Start').query('value==1') 
#map the variable column from above with our birthday` dataframe squeezed into a series. 
#Once mapped, subtract with the `Start` column to find out year difference.
Age=(melt_.Start-melt_.variable.map(birthdays.squeeze())).dt.days//365
#using pd.cut , we cut the Age into bins and assign our labels.
bins=[25,35,45,55,np.inf]
labels=['25-34','35-45','46-55','56+']
melt_=melt_.assign(Age=pd.cut(Age,bins,labels=labels))

Finally merge with original df with the crosstab output of pd.crosstab(melt_.Start,melt_.Age) : 最后使用pd.crosstab(melt_.Start,melt_.Age)crosstab输出与原始df合并:

projects.merge(pd.crosstab(melt_.Start,melt_.Age).reindex(columns=labels,fill_value=0)
                              ,left_on='Start',right_index=True)

              Start  John  Bob  Gladys  25-34  35-45  46-55  56+
Project                                                        
A       2014-01-08     1    0       0      1      0      0    0
B       2016-08-09     0    1       1      0      0      1    1
C       2018-02-06     0    1       0      0      0      0    1

Note : The expected output is wrong since Bob is elder and is 56+ as per project date. 注意 :预期的输出是错误的,因为Bob是老年人,根据项目日期是56+

This should give you the desired result: 这应该会给你想要的结果:

import pandas as pd
from datetime import datetime

#create the Projects dataframe
d1={'Projects': ['A', 'B','C'], 'Start': ['2014-01-08', '2016-08-09','2018-02-06'],'John':[1,0,0],'Bob':[0,1,1],'Gladys':[0,1,0]}
projects=pd.DataFrame(data=d1)
projects.set_index('Projects',inplace=True)

#create the Birthdays dataframe
d2={'name':['John','Bob','Gladys'],'birth':['1983-04-06','1946-11-03','1969-08-02']}
birthdays=pd.DataFrame(data=d2)
birthdays.set_index('name', inplace=True)

# Create the age slab dataframe with dummy values
d3={'Projects': ['A','B','C'], '25-34':[0,0,0], '35-45':[0,0,0], '46-55':[0,0,0], '56+':[0,0,0]}
age_slabs=pd.DataFrame(data=d3)
age_slabs.set_index('Projects',inplace=True)

#concatenate projects and age_slabs dataframe
concat=pd.concat([projects, age_slabs], axis=1)
concat=concat.copy()

pd.options.mode.chained_assignment = None

date_format = "%Y-%m-%d"

for i in range(len(concat)):

    val_John=((concat['John'][i]*(datetime.strptime(concat['Start'][i], date_format)-datetime.strptime(birthdays['birth'].John, date_format))).days)/365.
    val_Bob=((concat['Bob'][i]*(datetime.strptime(concat['Start'][i], date_format)-datetime.strptime(birthdays['birth'].Bob, date_format))).days)/365.
    val_Gladys=((concat['Gladys'][i]*(datetime.strptime(concat['Start'][i], date_format)-datetime.strptime(birthdays['birth'].Gladys, date_format))).days)/365.

    # check for conditions and set 1 or 0, and add up
    concat['25-34'][i] = 1 if ((val_John<=34) & (val_John>=25)) else 0 + 1 if ((val_Bob<=34) & (val_Bob>=25)) else 0 + 1 if ((val_Gladys<=34) & (val_Gladys>=25)) else 0
    concat['35-45'][i] = 1 if ((val_John<=45) & (val_John>=35)) else 0 + 1 if ((val_Bob<=45) & (val_Bob>=35)) else 0 + 1 if ((val_Gladys<=45) & (val_Gladys>=35)) else 0
    concat['46-55'][i] = 1 if ((val_John<=55) & (val_John>=46)) else 0 + 1 if ((val_Bob<=55) & (val_Bob>=46)) else 0 + 1 if ((val_Gladys<=55) & (val_Gladys>=46)) else 0
    concat['56+'][i] = 1 if ((val_John>=56)) else 0 + 1 if ((val_Bob>=56)) else 0 + 1 if ((val_Gladys>=56)) else 0

print (concat)

Output: 输出:

               Start  John  Bob  Gladys  25-34  35-45  46-55  56+
Projects                                                         
A         2014-01-08     1    0       0      1      0      0    0
B         2016-08-09     0    1       1      0      0      1    1
C         2018-02-06     0    1       0      0      0      0    1

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何根据不同列中的值向 pandas dataframe 添加一列? - How to add one column to pandas dataframe based on values in different columns? Pandas:根据不同dataframe的多列中的匹配值,在一个dataframe中创建一列 - Pandas: create a column in one dataframe based on matching values in multiple columns of a different dataframe Python:在 Pandas 中,根据条件从数据帧中的几列中提取数据,并添加到列上的不同数据帧匹配中 - Python: In Pandas extract data from several columns in a dataframe based on a condition and add to different dataframe matching on a column 根据另一个数据帧将列添加到 Pandas 数据帧并将值设置为零 - Add columns to Pandas dataframe based on another dataframe and set values to zero Pyspark-根据来自不同数据框的值向数据框添加列 - Pyspark - add columns to dataframe based on values from different dataframe Python Pandas:根据匹配值在 dataframe 中添加列 - Python Pandas: add column in a dataframe based on a matching values Pandas:根据列中的值向DataFrame添加新列 - Pandas: Add new columns to DataFrame based on values in columns 如何根据不同日期的某些列移动pandas数据框中的值 - How to shift values in pandas dataframe based on some columns in different day Pandas Dataframe:基于正/负值分配给不同的列 - Pandas Dataframe: assign to different columns based on positive/negative values Pandas 根据来自不同 DataFrame 的值添加新列 - Pandas Add New Columns Based on Vaues from Different DataFrame
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM