I've got a DataFrame that indicates members of a project and the project start date, and a second DataFrame that indicates birth dates. I'm trying to add a number of columns indicating the total number of people in certain age groups based on the start of each project.
print(projects)
Start John Bob Gladys
Project
A 2014-01-08 1 0 0
B 2016-08-09 0 1 1
C 2018-02-06 0 1 0
print(birthdays)
birth
name
John 1983-04-06
Gladys 1969-08-02
Bob 1946-11-03
I've considered using .apply()
or .iterrows()
methods but I'm having a hard time even wrapping my head around where to start. The real DataFrames have significantly more columns and rows so I need to avoid needing to call any columns by name.
This is what I'm trying to accomplish:
Start John Bob Gladys 25-34 35-45 46-55 56+
Project
A 2014-01-08 1 0 0 1 0 0 0
B 2016-08-09 0 1 1 0 0 1 1
C 2018-02-06 0 1 0 0 0 0 1
Any suggestions on where to start?
here is one way using melt
, cut
, crosstab
and merge
:
##unpivot the dataframe keeping Start as index and keep only value ==1
melt_=projects.melt('Start').query('value==1')
#map the variable column from above with our birthday` dataframe squeezed into a series.
#Once mapped, subtract with the `Start` column to find out year difference.
Age=(melt_.Start-melt_.variable.map(birthdays.squeeze())).dt.days//365
#using pd.cut , we cut the Age into bins and assign our labels.
bins=[25,35,45,55,np.inf]
labels=['25-34','35-45','46-55','56+']
melt_=melt_.assign(Age=pd.cut(Age,bins,labels=labels))
Finally merge with original df with the crosstab
output of pd.crosstab(melt_.Start,melt_.Age)
:
projects.merge(pd.crosstab(melt_.Start,melt_.Age).reindex(columns=labels,fill_value=0)
,left_on='Start',right_index=True)
Start John Bob Gladys 25-34 35-45 46-55 56+
Project
A 2014-01-08 1 0 0 1 0 0 0
B 2016-08-09 0 1 1 0 0 1 1
C 2018-02-06 0 1 0 0 0 0 1
Note : The expected output is wrong since Bob
is elder and is 56+
as per project date.
This should give you the desired result:
import pandas as pd
from datetime import datetime
#create the Projects dataframe
d1={'Projects': ['A', 'B','C'], 'Start': ['2014-01-08', '2016-08-09','2018-02-06'],'John':[1,0,0],'Bob':[0,1,1],'Gladys':[0,1,0]}
projects=pd.DataFrame(data=d1)
projects.set_index('Projects',inplace=True)
#create the Birthdays dataframe
d2={'name':['John','Bob','Gladys'],'birth':['1983-04-06','1946-11-03','1969-08-02']}
birthdays=pd.DataFrame(data=d2)
birthdays.set_index('name', inplace=True)
# Create the age slab dataframe with dummy values
d3={'Projects': ['A','B','C'], '25-34':[0,0,0], '35-45':[0,0,0], '46-55':[0,0,0], '56+':[0,0,0]}
age_slabs=pd.DataFrame(data=d3)
age_slabs.set_index('Projects',inplace=True)
#concatenate projects and age_slabs dataframe
concat=pd.concat([projects, age_slabs], axis=1)
concat=concat.copy()
pd.options.mode.chained_assignment = None
date_format = "%Y-%m-%d"
for i in range(len(concat)):
val_John=((concat['John'][i]*(datetime.strptime(concat['Start'][i], date_format)-datetime.strptime(birthdays['birth'].John, date_format))).days)/365.
val_Bob=((concat['Bob'][i]*(datetime.strptime(concat['Start'][i], date_format)-datetime.strptime(birthdays['birth'].Bob, date_format))).days)/365.
val_Gladys=((concat['Gladys'][i]*(datetime.strptime(concat['Start'][i], date_format)-datetime.strptime(birthdays['birth'].Gladys, date_format))).days)/365.
# check for conditions and set 1 or 0, and add up
concat['25-34'][i] = 1 if ((val_John<=34) & (val_John>=25)) else 0 + 1 if ((val_Bob<=34) & (val_Bob>=25)) else 0 + 1 if ((val_Gladys<=34) & (val_Gladys>=25)) else 0
concat['35-45'][i] = 1 if ((val_John<=45) & (val_John>=35)) else 0 + 1 if ((val_Bob<=45) & (val_Bob>=35)) else 0 + 1 if ((val_Gladys<=45) & (val_Gladys>=35)) else 0
concat['46-55'][i] = 1 if ((val_John<=55) & (val_John>=46)) else 0 + 1 if ((val_Bob<=55) & (val_Bob>=46)) else 0 + 1 if ((val_Gladys<=55) & (val_Gladys>=46)) else 0
concat['56+'][i] = 1 if ((val_John>=56)) else 0 + 1 if ((val_Bob>=56)) else 0 + 1 if ((val_Gladys>=56)) else 0
print (concat)
Output:
Start John Bob Gladys 25-34 35-45 46-55 56+
Projects
A 2014-01-08 1 0 0 1 0 0 0
B 2016-08-09 0 1 1 0 0 1 1
C 2018-02-06 0 1 0 0 0 0 1
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.