[英]How to transform python data frame such that unique row values are transposed to columns and values of another column become their rows
[英]How to transform rows of other columns to columns on the basis of unique values of a column?
假設我在以下結構中有一個 df,
column1 | column2 | column3 | column4 | column5 | column6 | column7
A | B | C | 10 | 78 | 12 | 202001
A | B | D | 21 | 64 | 87 | 202001
A | B | E | 21 | 64 | 87 | 202001
X | K | C | 54 | 23 | 23 | 202001
X | K | D | 21 | 55 | 87 | 202001
X | K | E | 21 | 43 | 22 | 202001
A | B | C | 10 | 78 | 12 | 202002
A | B | D | 23 | 64 | 87 | 202002
A | B | E | 21 | 11 | 34 | 202002
Z | K | C | 10 | 78 | 12 | 202002
Z | K | D | 21 | 13 | 56 | 202002
Z | K | E | 12 | 77 | 34 | 202002
column1 到 column2 之間的關系 - 一對多
column2 到 column1 之間的關系 - 一對多
預期 Output:
column1 | column2 | column3 | column4_202001 | column5_202001 | column6_202001 | column4_202002 | column5_202002 | column6_202002 |
A | B | C | 10 | 78 | 12 | 10 | 78 | 12 |
A | B | D | 21 | 64 | 87 | 23 | 64 | 87 |
A | B | E | 21 | 64 | 87 | 21 | 11 | 34 |
X | K | C | 54 | 23 | 23 | 0 | 0 | 0 |
X | K | D | 21 | 55 | 87 | 0 | 0 | 0 |
X | K | E | 21 | 43 | 22 | 0 | 0 | 0 |
Z | K | C | 0 | 0 | 0 | 10 | 78 | 12 |
Z | K | D | 0 | 0 | 0 | 21 | 13 | 56 |
Z | K | E | 0 | 0 | 0 | 12 | 77 | 34 |
另外,在轉換時,對於每個 column7,我可以在 column6_yyyymm 旁邊創建一個空列嗎?
最終 Output,
column1 | column2 | column3 | column4_202001 | column5_202001 | column6_202001 | empty_202001 | column4_202002 | column5_202002 | column6_202002 | empty_202002 ....
A | B | C | 10 | 78 | 12 | | 10 | 78 | 12 |
A | B | D | 21 | 64 | 87 | | 23 | 64 | 87 |
A | B | E | 21 | 64 | 87 | | 21 | 11 | 34 |
X | K | C | 54 | 23 | 23 | | 0 | 0 | 0 |
X | K | D | 21 | 55 | 87 | | 0 | 0 | 0 |
X | K | E | 21 | 43 | 22 | | 0 | 0 | 0 |
Z | K | C | 0 | 0 | 0 | | 10 | 78 | 12 |
Z | K | D | 0 | 0 | 0 | | 21 | 13 | 56 |
Z | K | E | 0 | 0 | 0 | | 12 | 77 | 34 |
如何使用 python function 和/或 Z3A43B4F8832925D94022CEFFAZ 庫實現最終 Output? 如果有任何不清楚的地方,請告訴我。
更新:
對於所有 empty_yyyymm 列,我想實現以下 function,
def get_final(row):
if row['column2'].isin(['H', 'S', 'Z']):
return 0
elif row['column4_yyyymm'] + row['column5_yyyymm'] - row['column6_yyyymm'] < 0 and not row['column2'].isin(['H', 'S', 'Z']):
return 0
else:
return row['column4_yyyymm'] + row['column5_yyyymm'] - row['column6_yyyymm']
怎么也能做到這一點?
注意:yyyymm 是引用 column7 的通用方式。 它實際上不是一個列。
嘗試:
df1 = (df.set_index(['column1', 'column2', 'column3', 'column7'])
.rename_axis(['idx'], axis=1)
.unstack('column7')
.reset_index().fillna(0))
df1.columns = df1.columns.map(lambda x: '_'.join([str(i) for i in x]) if (x[1])!='' else x[0])
第 1 列 | 第 2 列 | 第 3 列 | column4_202001 | column4_202002 | 列 5_202001 | 專欄5_202002 | 專欄6_202001 | 專欄6_202002 | |
---|---|---|---|---|---|---|---|---|---|
0 | 一個 | 乙 | C | 10.0 | 10.0 | 78.0 | 78.0 | 12.0 | 12.0 |
1 | 一個 | 乙 | D | 21.0 | 23.0 | 64.0 | 64.0 | 87.0 | 87.0 |
2 | 一個 | 乙 | 乙 | 21.0 | 21.0 | 64.0 | 11.0 | 87.0 | 34.0 |
3 | X | ķ | C | 54.0 | 0.0 | 23.0 | 0.0 | 23.0 | 0.0 |
4 | X | ķ | D | 21.0 | 0.0 | 55.0 | 0.0 | 87.0 | 0.0 |
5 | X | ķ | 乙 | 21.0 | 0.0 | 43.0 | 0.0 | 22.0 | 0.0 |
6 | Z | ķ | C | 0.0 | 10.0 | 0.0 | 78.0 | 0.0 | 12.0 |
7 | Z | ķ | D | 0.0 | 21.0 | 0.0 | 13.0 | 0.0 | 56.0 |
8 | Z | ķ | 乙 | 0.0 | 12.0 | 0.0 | 77.0 | 0.0 | 34.0 |
First create empty column by DataFrame.assign
, then reshape by DataFrame.set_index
with DataFrame.unstack
and sorting datetimes in second level by DataFrame.sort_index
:
df = (df.assign(empty = np.nan)
.set_index(['column1','column2','column3','column7'])
.unstack(fill_value=0)
.sort_index(level=1, axis=1))
然后將缺少的值設置為所有empty
列,通過map
將MultiIndex in columns
的 MultiIndex 展平,最后通過DataFrame.reset_index
將index
轉換為列:
df['empty'] = np.nan
#if need fill by empty string
#df['empty'] = ''
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
df = df.reset_index()
print (df)
column1 column2 column3 column4_202001 column5_202001 column6_202001 \
0 A B C 10 78 12
1 A B D 21 64 87
2 A B E 21 64 87
3 X K C 54 23 23
4 X K D 21 55 87
5 X K E 21 43 22
6 Z K C 0 0 0
7 Z K D 0 0 0
8 Z K E 0 0 0
empty_202001 column4_202002 column5_202002 column6_202002 empty_202002
0 NaN 10 78 12 NaN
1 NaN 23 64 87 NaN
2 NaN 21 11 34 NaN
3 NaN 0 0 0 NaN
4 NaN 0 0 0 NaN
5 NaN 0 0 0 NaN
6 NaN 10 78 12 NaN
7 NaN 21 13 56 NaN
8 NaN 12 77 34 NaN
編輯:首先按條件計算新列為empty
,然后在沒有設置NaN
的情況下應用上面的解決方案,例如:
m1 = df['column2'].isin(['H', 'S', 'Z'])
s = df['column4'] + df['column5'] - df['column6']
m2 = (s < 0) & ~m1
out = np.where(m1 | m2, 0, s)
df = (df.assign(empty = out)
.set_index(['column1','column2','column3','column7'])
.unstack(fill_value=0)
.sort_index(level=1, axis=1))
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
df = df.reset_index()
print (df)
column1 column2 column3 column4_202001 column5_202001 column6_202001 \
0 A B C 10 78 12
1 A B D 21 64 87
2 A B E 21 64 87
3 X K C 54 23 23
4 X K D 21 55 87
5 X K E 21 43 22
6 Z K C 0 0 0
7 Z K D 0 0 0
8 Z K E 0 0 0
empty_202001 column4_202002 column5_202002 column6_202002 empty_202002
0 76 10 78 12 76
1 0 23 64 87 0
2 0 21 11 34 0
3 54 0 0 0 0
4 0 0 0 0 0
5 42 0 0 0 0
6 0 10 78 12 76
7 0 21 13 56 0
8 0 12 77 34 55
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.