[英]Convert a dataframe in pandas based on column names
我有一個熊貓數據框,看起來像這樣:
employeeId cumbId firstName lastName emailAddress \
0 E123456 102939485 Andrew Hoover hoovera@xyz.com
1 E123457 675849302 Curt Austin austinc1@xyz.com
2 E123458 354852739 Celeste Riddick riddickc@xyz.com
3 E123459 937463528 Hazel Tooley tooleyh@xyz.com
employeeIdTypeCode cumbIDTypeCode entityCode sourceCode roleCode
0 001 002 AE AWB EMPLR
1 001 002 AE AWB EMPLR
2 001 002 AE AWB EMPLR
3 001 002 AE AWB EMPLR
我希望它對熊貓數據框中的每個ID和IDtypecode看起來像這樣:
idvalue IDTypeCode firstName lastName emailAddress entityCode sourceCode roleCode CodeName
E123456 001 Andrew Hoover hoovera@xyz.com AE AWB EMPLR 1
102939485 002 Andrew Hoover hoovera@xyz.com AE AWB EMPLR 1
可以通過熊貓數據框中的某些功能來實現嗎? 我還希望它根據數據框中ID的數量是動態的。
我所說的是動態的,如果有3個Ids
那么它應該是這樣的:
idvalue IDTypeCode firstName lastName emailAddress entityCode sourceCode roleCode CodeName
A123456 001 Andrew Hoover hoovera@xyz.com AE AWB EMPLR 1
102939485 002 Andrew Hoover hoovera@xyz.com AE AWB EMPLR 1
M1000 003 Andrew Hoover hoovera@xyz.com AE AWB EMPLR 1
謝謝!
我認為這就是您要尋找的...在拆分數據框的各個部分之后,可以使用concat:
# create a new df without the id columns
df2 = df.loc[:, ~df.columns.isin(['employeeId','employeeIdTypeCode'])]
# rename columns to match the df columns names that they "match" to
df2 = df2.rename(columns={'cumbId':'employeeId', 'cumbIDTypeCode':'employeeIdTypeCode'})
# concat you dataframes
pd.concat([df,df2], sort=False).drop(columns=['cumbId','cumbIDTypeCode']).sort_values('firstName')
# rename columns here if you want
# sample df
employeeId cumbId otherId1 firstName lastName emailAddress \
0 E123456 102939485 5 Andrew Hoover hoovera@xyz.com
1 E123457 675849302 5 Curt Austin austinc1@xyz.com
2 E123458 354852739 5 Celeste Riddick riddickc@xyz.com
3 E123459 937463528 5 Hazel Tooley tooleyh@xyz.com
employeeIdTypeCode cumbIDTypeCode otherIdTypeCode1 entityCode sourceCode \
0 1 2 6 AE AWB
1 1 2 6 AE AWB
2 1 2 6 AE AWB
3 1 2 6 AE AWB
roleCode
0 EMPLR
1 EMPLR
2 EMPLR
3 EMPLR
必須有一些規則:
規則1.總是有兩個“匹配列”規則2.所有匹配的ID都彼此相鄰3.知道ID組的數量(要添加的行)
def myFunc(df, num_id): # num_id is the number of id groups
# find all columns that contain the string id
id_col = df.loc[:, df.columns.str.lower().str.contains('id')].columns
# rename columns to id_0 and id_1
df = df.rename(columns=dict(zip(df.loc[:, df.columns.str.lower().str.contains('id')].columns,
['id_'+str(i) for i in range(int(len(id_col)/num_id)) for x in range(num_id)])))
# groupby columns and values.tolist
new = df.groupby(df.columns.values, axis=1).agg(lambda x: x.values.tolist())
data = []
# for-loop to explode the lists
for n in range(len(new.loc[:, new.columns.str.lower().str.contains('id')].columns)):
s = new.loc[:, new.columns.str.lower().str.contains('id')]
i = np.arange(len(new)).repeat(s.iloc[:,n].str.len())
data.append(new.iloc[i, :-1].assign(**{'id_'+str(n): np.concatenate(s.iloc[:,n].values)}))
# remove the list from all cells
data0 = data[0].applymap(lambda x: x[0] if isinstance(x, list) else x).drop_duplicates()
data1 = data[1].applymap(lambda x: x[0] if isinstance(x, list) else x).drop_duplicates()
# update dataframes
data0.update(data1[['id_1']])
return data0
myFunc(df,3)
emailAddress entityCode firstName id_0 id_1 lastName roleCode
0 hoovera@xyz.com AE Andrew E123456 1 Hoover EMPLR
0 hoovera@xyz.com AE Andrew 102939485 2 Hoover EMPLR
0 hoovera@xyz.com AE Andrew 5 6 Hoover EMPLR
1 austinc1@xyz.com AE Curt E123457 1 Austin EMPLR
1 austinc1@xyz.com AE Curt 675849302 2 Austin EMPLR
1 austinc1@xyz.com AE Curt 5 6 Austin EMPLR
2 riddickc@xyz.com AE Celeste E123458 1 Riddick EMPLR
2 riddickc@xyz.com AE Celeste 354852739 2 Riddick EMPLR
2 riddickc@xyz.com AE Celeste 5 6 Riddick EMPLR
3 tooleyh@xyz.com AE Hazel E123459 1 Tooley EMPLR
3 tooleyh@xyz.com AE Hazel 937463528 2 Tooley EMPLR
3 tooleyh@xyz.com AE Hazel 5 6 Tooley EMPLR
據我了解,您要為每個源行生成2行:
employeeId
(重命名為idvalue
),然后IDTypeCode
='001',然后是'remainig'列(但不是全部),最后是CodeName
='1'。 cumbId
,然后IDTypeCode
='002',相同的'remainig'列和CodeName
(也='1')。 因此,下面給出的程序會生成2個DataFrame( df1
和df2
),然后生成“交織”其行的結果。
import pandas as pd
data = [
[ 'E123456', '102939485', 'Andrew', 'Hoover', 'hoovera@xyz.com', '001', '002', 'AE', 'AWB', 'EMPLR' ],
[ 'E123457', '675849302', 'Curt', 'Austin', 'austinc1@xyz.com', '001', '002', 'AE', 'AWB', 'EMPLR' ],
[ 'E123458', '354852739', 'Celeste', 'Riddick', 'riddickc@xyz.com', '001', '002', 'AE', 'AWB', 'EMPLR' ],
[ 'E123459', '937463528', 'Hazel', 'Tooley', 'tooleyh@xyz.com', '001', '002', 'AE', 'AWB', 'EMPLR' ]
]
df = pd.DataFrame(data=data, columns=['employeeId', 'cumbId', 'firstName', 'lastName',
'emailAddress', 'employeeIdTypeCode', 'cumbIDTypeCode', 'entityCode', 'sourceCode',
'roleCode' ])
# 'Remainig' columns
cols = ['firstName', 'lastName', 'emailAddress', 'entityCode', 'sourceCode', 'roleCode']
# df1: employeeId, IDTypeCode = '001' and 'remainig' columns
df1 = df[['employeeId']].set_axis(['idvalue'], axis=1, inplace=False)
df1['IDTypeCode'] = '001'
df1 = df1.join(df[cols])
df1['CodeName'] = '1'
# df2: cumbId, IDTypeCode = '002' and 'remainig' columns
df2 = df[['cumbId']].set_axis(['idvalue'], axis=1, inplace=False)
df2['IDTypeCode'] = '002'
df2 = df2.join(df[cols])
df2['CodeName'] = '1'
# Result
result = pd.concat([df1,df2]).sort_index().reset_index(drop=True)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.