简体   繁体   中英

Convert a dataframe in pandas based on column names

I have a pandas dataframe that looks something like this:

employeeId     cumbId firstName lastName        emailAddress  \
0    E123456  102939485    Andrew   Hoover   hoovera@xyz.com   
1    E123457  675849302      Curt   Austin  austinc1@xyz.com   
2    E123458  354852739   Celeste  Riddick  riddickc@xyz.com   
3    E123459  937463528     Hazel   Tooley   tooleyh@xyz.com     

  employeeIdTypeCode cumbIDTypeCode entityCode sourceCode roleCode  
0                001            002      AE      AWB    EMPLR  
1                001            002      AE      AWB    EMPLR  
2                001            002      AE      AWB    EMPLR  
3                001            002      AE      AWB    EMPLR  

I want it to look something like this for each ID and IDtypecode in the pandas dataframe:

idvalue   IDTypeCode  firstName lastName  emailAddress  entityCode  sourceCode  roleCode  CodeName
E123456   001         Andrew    Hoover    hoovera@xyz.com AE        AWB         EMPLR     1
102939485 002         Andrew    Hoover    hoovera@xyz.com AE        AWB         EMPLR     1

Can this be achieved with some function in pandas dataframe? I also want it to be dynamic based on the number of IDs that are in the dataframe.

What I mean by dynamic is this, if there are 3 Ids then this is how it should look like:

idvalue   IDTypeCode  firstName lastName  emailAddress  entityCode  sourceCode  roleCode  CodeName
A123456   001         Andrew    Hoover    hoovera@xyz.com AE        AWB         EMPLR     1
102939485 002         Andrew    Hoover    hoovera@xyz.com AE        AWB         EMPLR     1
M1000     003         Andrew    Hoover    hoovera@xyz.com AE        AWB         EMPLR     1

Thank you!

I think this is what you are looking for... you can use concat after splitting out the parts of your dataframe:

# create a new df without the id columns
df2 = df.loc[:, ~df.columns.isin(['employeeId','employeeIdTypeCode'])]

# rename columns to match the df columns names that they "match" to
df2 = df2.rename(columns={'cumbId':'employeeId', 'cumbIDTypeCode':'employeeIdTypeCode'})

# concat you dataframes
pd.concat([df,df2], sort=False).drop(columns=['cumbId','cumbIDTypeCode']).sort_values('firstName')

# rename columns here if you want

update

# sample df
  employeeId     cumbId  otherId1 firstName lastName      emailAddress  \
0    E123456  102939485         5    Andrew   Hoover   hoovera@xyz.com   
1    E123457  675849302         5      Curt   Austin  austinc1@xyz.com   
2    E123458  354852739         5   Celeste  Riddick  riddickc@xyz.com   
3    E123459  937463528         5     Hazel   Tooley   tooleyh@xyz.com   

   employeeIdTypeCode  cumbIDTypeCode  otherIdTypeCode1 entityCode sourceCode  \
0                   1               2                 6         AE        AWB   
1                   1               2                 6         AE        AWB   
2                   1               2                 6         AE        AWB   
3                   1               2                 6         AE        AWB   

  roleCode  
0    EMPLR  
1    EMPLR  
2    EMPLR  
3    EMPLR  

There has to be some rules in place:

rule 1. there are always two "match columns" rule 2. all the matched ids are next to each other rule 3. you know the number of Ids groups (rows to add)

def myFunc(df, num_id): # num_id is the number of id groups 
    # find all columns that contain the string id
    id_col = df.loc[:, df.columns.str.lower().str.contains('id')].columns

    # rename columns to id_0 and id_1
    df = df.rename(columns=dict(zip(df.loc[:, df.columns.str.lower().str.contains('id')].columns,
                                ['id_'+str(i) for i in range(int(len(id_col)/num_id)) for x in range(num_id)])))

    # groupby columns and values.tolist
    new = df.groupby(df.columns.values, axis=1).agg(lambda x: x.values.tolist())

    data = []

    # for-loop to explode the lists
    for n in range(len(new.loc[:, new.columns.str.lower().str.contains('id')].columns)):
        s = new.loc[:, new.columns.str.lower().str.contains('id')]
        i = np.arange(len(new)).repeat(s.iloc[:,n].str.len())
        data.append(new.iloc[i, :-1].assign(**{'id_'+str(n): np.concatenate(s.iloc[:,n].values)}))

    # remove the list from all cells
    data0 = data[0].applymap(lambda x: x[0] if isinstance(x, list) else x).drop_duplicates()
    data1 = data[1].applymap(lambda x: x[0] if isinstance(x, list) else x).drop_duplicates()

    # update dataframes
    data0.update(data1[['id_1']])

    return data0

myFunc(df,3)


      emailAddress entityCode firstName       id_0  id_1 lastName roleCode
0   hoovera@xyz.com         AE    Andrew    E123456     1   Hoover    EMPLR
0   hoovera@xyz.com         AE    Andrew  102939485     2   Hoover    EMPLR
0   hoovera@xyz.com         AE    Andrew          5     6   Hoover    EMPLR
1  austinc1@xyz.com         AE      Curt    E123457     1   Austin    EMPLR
1  austinc1@xyz.com         AE      Curt  675849302     2   Austin    EMPLR
1  austinc1@xyz.com         AE      Curt          5     6   Austin    EMPLR
2  riddickc@xyz.com         AE   Celeste    E123458     1  Riddick    EMPLR
2  riddickc@xyz.com         AE   Celeste  354852739     2  Riddick    EMPLR
2  riddickc@xyz.com         AE   Celeste          5     6  Riddick    EMPLR
3   tooleyh@xyz.com         AE     Hazel    E123459     1   Tooley    EMPLR
3   tooleyh@xyz.com         AE     Hazel  937463528     2   Tooley    EMPLR
3   tooleyh@xyz.com         AE     Hazel          5     6   Tooley    EMPLR

As I understood, for each source row you want to generate 2 rows:

  • employeeId (renamed to idvalue ), then IDTypeCode = '001', then 'remainig' columns (but not all) and finally CodeName = '1'.
  • cumbId , then IDTypeCode = '002', the same 'remainig' columns and CodeName (also = '1').

So the program given below generates such 2 DataFrames ( df1 and df2 ) and then generates the result "interleaving" their rows.

import pandas as pd

data = [
    [ 'E123456', '102939485', 'Andrew',  'Hoover',  'hoovera@xyz.com',  '001', '002', 'AE', 'AWB', 'EMPLR' ],
    [ 'E123457', '675849302', 'Curt',    'Austin',  'austinc1@xyz.com', '001', '002', 'AE', 'AWB', 'EMPLR' ],
    [ 'E123458', '354852739', 'Celeste', 'Riddick', 'riddickc@xyz.com', '001', '002', 'AE', 'AWB', 'EMPLR' ],
    [ 'E123459', '937463528', 'Hazel',   'Tooley',  'tooleyh@xyz.com',  '001', '002', 'AE', 'AWB', 'EMPLR' ]
]
df = pd.DataFrame(data=data, columns=['employeeId', 'cumbId', 'firstName', 'lastName',
    'emailAddress', 'employeeIdTypeCode', 'cumbIDTypeCode', 'entityCode', 'sourceCode',
    'roleCode' ])
# 'Remainig' columns
cols = ['firstName', 'lastName', 'emailAddress', 'entityCode', 'sourceCode', 'roleCode']
# df1: employeeId, IDTypeCode = '001' and 'remainig' columns
df1 = df[['employeeId']].set_axis(['idvalue'], axis=1, inplace=False)
df1['IDTypeCode'] = '001'
df1 = df1.join(df[cols])
df1['CodeName'] = '1'
# df2: cumbId, IDTypeCode = '002' and 'remainig' columns
df2 = df[['cumbId']].set_axis(['idvalue'], axis=1, inplace=False)
df2['IDTypeCode'] = '002'
df2 = df2.join(df[cols])
df2['CodeName'] = '1'
# Result
result = pd.concat([df1,df2]).sort_index().reset_index(drop=True)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM