[英]How to restructure dataframe to convert column values to row values based on condition
I have a dataframe with 5 columns and want to convert 2 of the columns (Chemo and Surgery) based on their values (greater than 0) to rows (diagnosis series) and add the information like the individual id and diagnosis at age to the rows. 我有一个包含5列的数据框,并希望根据其值(大于0)将其中的2列(Chemo和Surgery)转换为行(诊断系列),然后将诸如个体ID和年龄诊断的信息添加到各行。
Here is my dataframe 这是我的数据框
import pandas as pd
data = [['A-1', 'Birth', '0', '0', '0'], ['A-1', 'Lung cancer', '25', '25','25'],['A-1', 'Death', '50', '0','0'],['A-2', 'Birth', '0', '0','0'], ['A-2','Brain cancer', '12', '12','0'],['A-2', 'Skin cancer', '20','20','20'], ['A-2', 'Current age', '23', '0','0'],['A-3', 'Birth','0','0','0'], ['A-3', 'Brain cancer', '30', '0','30'], ['A-3', 'Lung cancer', '33', '33', '0'], ['A-3', 'Current age', '35', '0','0']]
df = pd.DataFrame(data, columns=["ID", "Diagnosis", "Age at Diagnosis", "Chemo", "Surgery"])
print df
I have tried to get the values where the Chemo/Surgery is greater than 0 but when I tried to add it as a row, it doesn't work. 我尝试获取Chemo / Surgery大于0的值,但是当我尝试将其作为一行添加时,它不起作用。
This is what I want the end result to be. 这就是我想要的最终结果。
ID Diagnosis Age at Diagnosis
0 A-1 Birth 0
1 A-1 Lung cancer 25
2 A-1 Chemo 25
3 A-1 Surgery 25
4 A-1 Death 50
5 A-2 Birth 0
6 A-2 Brain cancer 12
7 A-2 Chemo 12
8 A-2 Skin cancer 20
9 A-2 Chemo 20
10 A-2 Surgery 20
11 A-2 Current age 23
12 A-3 Birth 0
13 A-3 Brain cancer 30
14 A-3 Surgery 30
15 A-3 Lung cancer 33
16 A-3 Chemo 33
17 A-3 Current age 35
This is one of the things I have tried: 这是我尝试过的事情之一:
chem = "Chemo"
try_df = (df[chem] > 1)
nd = df[try_df]
df["Diagnosis"] = df[chem]
print df
We can melt the two columns Chemo
and Surgery
, then drop all the zero and concat
back: 我们可以融化
Chemo
和Surgery
的两列,然后将所有零放回并concat
:
# melt the two columns
new_df = df[['ID', 'Chemo', 'Surgery']].melt(id_vars='ID',
value_name='Age at Diagnosis',
var_name='Diagnosis')
# filter out the zeros
new_df = new_df[new_df['Age at Diagnosis'].ne('0')]
# concat with the original dataframe, ignoring the extra columns
new_df = pd.concat((df,new_df), sort=False, join='inner')
# sort values
new_df.sort_values(['ID','Age at Diagnosis'])
Output: 输出:
ID Diagnosis Age at Diagnosis
0 A-1 Birth 0
1 A-1 Lung cancer 25
1 A-1 Chemo 25
12 A-1 Surgery 25
2 A-1 Death 50
3 A-2 Birth 0
4 A-2 Brain cancer 12
4 A-2 Chemo 12
5 A-2 Skin cancer 20
5 A-2 Chemo 20
16 A-2 Surgery 20
6 A-2 Current age 23
7 A-3 Birth 0
8 A-3 Brain cancer 30
19 A-3 Surgery 30
9 A-3 Lung cancer 33
9 A-3 Chemo 33
10 A-3 Current age 35
This attempt is pretty verbose and takes a few steps. 此尝试非常冗长,需要执行一些步骤。 WE can't do a simple pivot or index/column stacking because we need to modify one column with partial results from another.
我们无法进行简单的数据透视或索引/列堆叠,因为我们需要用另一列的部分结果来修改一列。 This requires splitting and appending.
这需要拆分和追加。
Firstly, convert your dataframe into dtypes we can work with. 首先,将您的数据框转换为我们可以使用的dtype。
data = [['A-1', 'Birth', '0', '0', '0'], ['A-1', 'Lung cancer', '25', '25','25'],['A-1', 'Death', '50', '0','0'],['A-2', 'Birth', '0', '0','0'], ['A-2','Brain cancer', '12', '12','0'],['A-2', 'Skin cancer', '20','20','20'], ['A-2', 'Current age', '23', '0','0'],['A-3', 'Birth','0','0','0'], ['A-3', 'Brain cancer', '30', '0','30'], ['A-3', 'Lung cancer', '33', '33', '0'], ['A-3', 'Current age', '35', '0','0']]
df = pd.DataFrame(data, columns=["ID", "Diagnosis", "Age at Diagnosis", "Chemo", "Surgery"])
df[["Age at Diagnosis", "Chemo", "Surgery"]] = df[["Age at Diagnosis", "Chemo", "Surgery"]].astype(int)
Now we split the thing up into bits and pieces. 现在,我们将事情分解成碎片。
# I like making a copy or resetting an index so that
# pandas is not operating off a slice
df_chemo = df[df.Chemo > 0].copy()
df_surgery = df[df.Surgery > 0].copy()
# drop columns you don't need
df_chemo.drop(["Chemo", "Surgery"], axis=1, inplace=True)
df_surgery.drop(["Chemo", "Surgery"], axis=1, inplace=True)
df.drop(["Chemo", "Surgery"], axis=1, inplace=True)
# Set Chemo and Surgery Diagnosis
df_chemo.Diagnosis = "Chemo"
df_surgery.Diagnosis = "Surgery"
Then append everything together. 然后将所有内容附加在一起。 You can do this because the column dimensions match.
您可以这样做,因为列尺寸匹配。
df_new = df.append(df_chemo).append(df_surgery)
# make it look pretty
df_new.sort_values(["ID", "Age at Diagnosis"]).reset_index(drop=True)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.