
[英]What is the best way to create new columns based on value of existing column of Pandas dataframe in Python?
[英]Create new columns for the duplicate records:Python
我有一个输入文件,正在以这种形式的运行时生成 :情况1:
ID,Numbers,P_ID,Cores,Count
1,1234567890,A1,200,3
2,1234567890,A2,150,3
3,0123459876,A3,1000,1
生成的文件也可以采用以下格式:情况2:
ID,Numbers,P_ID,Cores,Count
1,1234567890,A1,200,3
3,0123459876,A3,1000,1
预期输出:情况1:
Numbers ID_1 P_ID_1 Cores_1 Count_1 ID_2 P_ID_2 Cores_2 Count_2
0 123459876 3 A3 1000 1 NaN None NaN NaN
1 1234567890 1 A1 200 3 2.0 A2 150.0 3.0
情况2:
Numbers ID_1 P_ID_1 Cores_1 Count_1 ID_2 P_ID_2 Cores_2 Count_2
0 123459876 3 A3 1000 1 NaN None NaN NaN
1 1234567890 1 A1 200 3 Nan None Nan Nan
在输入文件中,可能有0或1或2行(但绝对不能超过2行)具有相同的Number(1234567890)。 我试图将这两行总结为1个单行(如输出文件中所示)。
我想将输入文件转换为上述结构,该怎么做? 我真的是熊猫新手。 请您提供帮助。 提前致谢。
在情况2中:
输出文件的结构必须保持相同,即列名应相同。
我认为您需要:
df['g'] = df.groupby('Numbers').cumcount()
df = df.set_index(['Numbers', 'g']).unstack().sort_index(axis=1, level=1)
df.columns = ['_'.join((x[0], str(x[1] + 1))) for x in df.columns]
df = df.reset_index()
print (df)
Numbers ID_1 P_ID_1 Cores_1 Count_1 ID_2 P_ID_2 Cores_2 Count_2
0 123459876 3.0 A3 1000.0 1.0 NaN None NaN NaN
1 1234567890 1.0 A1 200.0 3.0 2.0 A2 150.0 3.0
编辑:
可以使用自定义函数将其转换为int
,该函数仅在没有error
进行转换-因此具有NaN
的列不会更改:
def f(x):
try:
return x.astype(int)
except (TypeError, ValueError):
return x
df['g'] = df.groupby('Numbers').cumcount()
df1 = df.set_index(['Numbers', 'g']).unstack().sort_index(axis=1, level=1)
df1.columns = ['_'.join((x[0], str(x[1] + 1))) for x in df1.columns]
df1 = df1.apply(f).reset_index()
print (df1)
Numbers ID_1 P_ID_1 Cores_1 Count_1 ID_2 P_ID_2 Cores_2 Count_2
0 123459876 3 A3 1000 1 NaN None NaN NaN
1 1234567890 1 A1 200 3 2.0 A2 150.0 3.0
EDIT1:
每个组必须有1或2行,因此可以使用reindex_axis
:
def f(x):
try:
return x.astype(int)
except (TypeError, ValueError):
return x
df['g'] = df.groupby('Numbers').cumcount()
df1 = df.set_index(['Numbers', 'g']).unstack().sort_index(axis=1, level=1)
df1.columns = ['_'.join((x[0], str(x[1] + 1))) for x in df1.columns]
cols = ['ID_1','P_ID_1','Cores_1','Count_1','ID_2','P_ID_2','Cores_2','Count_2']
df1 = df1.apply(f).reindex_axis(cols, axis=1).reset_index()
print (df1)
Numbers ID_1 P_ID_1 Cores_1 Count_1 ID_2 P_ID_2 Cores_2 Count_2
0 123459876 3 A3 1000 1 NaN NaN NaN NaN
1 1234567890 1 A1 200 3 NaN NaN NaN NaN
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.