I have a pandas dataframe that looks something like this:
employeeId cumbId firstName lastName emailAddress \
0 E123456 102939485 Andrew Hoover hoovera@xyz.com
1 E123457 675849302 Curt Austin austinc1@xyz.com
2 E123458 354852739 Celeste Riddick riddickc@xyz.com
3 E123459 937463528 Hazel Tooley tooleyh@xyz.com
employeeIdTypeCode cumbIDTypeCode entityCode sourceCode roleCode
0 001 002 AE AWB EMPLR
1 001 002 AE AWB EMPLR
2 001 002 AE AWB EMPLR
3 001 002 AE AWB EMPLR
I want it to look something like this for each ID and IDtypecode in the pandas dataframe:
idvalue IDTypeCode firstName lastName emailAddress entityCode sourceCode roleCode CodeName
E123456 001 Andrew Hoover hoovera@xyz.com AE AWB EMPLR 1
102939485 002 Andrew Hoover hoovera@xyz.com AE AWB EMPLR 1
Can this be achieved with some function in pandas dataframe? I also want it to be dynamic based on the number of IDs that are in the dataframe.
What I mean by dynamic is this, if there are 3 Ids
then this is how it should look like:
idvalue IDTypeCode firstName lastName emailAddress entityCode sourceCode roleCode CodeName
A123456 001 Andrew Hoover hoovera@xyz.com AE AWB EMPLR 1
102939485 002 Andrew Hoover hoovera@xyz.com AE AWB EMPLR 1
M1000 003 Andrew Hoover hoovera@xyz.com AE AWB EMPLR 1
Thank you!
I think this is what you are looking for... you can use concat after splitting out the parts of your dataframe:
# create a new df without the id columns
df2 = df.loc[:, ~df.columns.isin(['employeeId','employeeIdTypeCode'])]
# rename columns to match the df columns names that they "match" to
df2 = df2.rename(columns={'cumbId':'employeeId', 'cumbIDTypeCode':'employeeIdTypeCode'})
# concat you dataframes
pd.concat([df,df2], sort=False).drop(columns=['cumbId','cumbIDTypeCode']).sort_values('firstName')
# rename columns here if you want
# sample df
employeeId cumbId otherId1 firstName lastName emailAddress \
0 E123456 102939485 5 Andrew Hoover hoovera@xyz.com
1 E123457 675849302 5 Curt Austin austinc1@xyz.com
2 E123458 354852739 5 Celeste Riddick riddickc@xyz.com
3 E123459 937463528 5 Hazel Tooley tooleyh@xyz.com
employeeIdTypeCode cumbIDTypeCode otherIdTypeCode1 entityCode sourceCode \
0 1 2 6 AE AWB
1 1 2 6 AE AWB
2 1 2 6 AE AWB
3 1 2 6 AE AWB
roleCode
0 EMPLR
1 EMPLR
2 EMPLR
3 EMPLR
There has to be some rules in place:
rule 1. there are always two "match columns" rule 2. all the matched ids are next to each other rule 3. you know the number of Ids groups (rows to add)
def myFunc(df, num_id): # num_id is the number of id groups
# find all columns that contain the string id
id_col = df.loc[:, df.columns.str.lower().str.contains('id')].columns
# rename columns to id_0 and id_1
df = df.rename(columns=dict(zip(df.loc[:, df.columns.str.lower().str.contains('id')].columns,
['id_'+str(i) for i in range(int(len(id_col)/num_id)) for x in range(num_id)])))
# groupby columns and values.tolist
new = df.groupby(df.columns.values, axis=1).agg(lambda x: x.values.tolist())
data = []
# for-loop to explode the lists
for n in range(len(new.loc[:, new.columns.str.lower().str.contains('id')].columns)):
s = new.loc[:, new.columns.str.lower().str.contains('id')]
i = np.arange(len(new)).repeat(s.iloc[:,n].str.len())
data.append(new.iloc[i, :-1].assign(**{'id_'+str(n): np.concatenate(s.iloc[:,n].values)}))
# remove the list from all cells
data0 = data[0].applymap(lambda x: x[0] if isinstance(x, list) else x).drop_duplicates()
data1 = data[1].applymap(lambda x: x[0] if isinstance(x, list) else x).drop_duplicates()
# update dataframes
data0.update(data1[['id_1']])
return data0
myFunc(df,3)
emailAddress entityCode firstName id_0 id_1 lastName roleCode
0 hoovera@xyz.com AE Andrew E123456 1 Hoover EMPLR
0 hoovera@xyz.com AE Andrew 102939485 2 Hoover EMPLR
0 hoovera@xyz.com AE Andrew 5 6 Hoover EMPLR
1 austinc1@xyz.com AE Curt E123457 1 Austin EMPLR
1 austinc1@xyz.com AE Curt 675849302 2 Austin EMPLR
1 austinc1@xyz.com AE Curt 5 6 Austin EMPLR
2 riddickc@xyz.com AE Celeste E123458 1 Riddick EMPLR
2 riddickc@xyz.com AE Celeste 354852739 2 Riddick EMPLR
2 riddickc@xyz.com AE Celeste 5 6 Riddick EMPLR
3 tooleyh@xyz.com AE Hazel E123459 1 Tooley EMPLR
3 tooleyh@xyz.com AE Hazel 937463528 2 Tooley EMPLR
3 tooleyh@xyz.com AE Hazel 5 6 Tooley EMPLR
As I understood, for each source row you want to generate 2 rows:
employeeId
(renamed to idvalue
), then IDTypeCode
= '001', then 'remainig' columns (but not all) and finally CodeName
= '1'. cumbId
, then IDTypeCode
= '002', the same 'remainig' columns and CodeName
(also = '1'). So the program given below generates such 2 DataFrames ( df1
and df2
) and then generates the result "interleaving" their rows.
import pandas as pd
data = [
[ 'E123456', '102939485', 'Andrew', 'Hoover', 'hoovera@xyz.com', '001', '002', 'AE', 'AWB', 'EMPLR' ],
[ 'E123457', '675849302', 'Curt', 'Austin', 'austinc1@xyz.com', '001', '002', 'AE', 'AWB', 'EMPLR' ],
[ 'E123458', '354852739', 'Celeste', 'Riddick', 'riddickc@xyz.com', '001', '002', 'AE', 'AWB', 'EMPLR' ],
[ 'E123459', '937463528', 'Hazel', 'Tooley', 'tooleyh@xyz.com', '001', '002', 'AE', 'AWB', 'EMPLR' ]
]
df = pd.DataFrame(data=data, columns=['employeeId', 'cumbId', 'firstName', 'lastName',
'emailAddress', 'employeeIdTypeCode', 'cumbIDTypeCode', 'entityCode', 'sourceCode',
'roleCode' ])
# 'Remainig' columns
cols = ['firstName', 'lastName', 'emailAddress', 'entityCode', 'sourceCode', 'roleCode']
# df1: employeeId, IDTypeCode = '001' and 'remainig' columns
df1 = df[['employeeId']].set_axis(['idvalue'], axis=1, inplace=False)
df1['IDTypeCode'] = '001'
df1 = df1.join(df[cols])
df1['CodeName'] = '1'
# df2: cumbId, IDTypeCode = '002' and 'remainig' columns
df2 = df[['cumbId']].set_axis(['idvalue'], axis=1, inplace=False)
df2['IDTypeCode'] = '002'
df2 = df2.join(df[cols])
df2['CodeName'] = '1'
# Result
result = pd.concat([df1,df2]).sort_index().reset_index(drop=True)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.