[英]Splitting a Column on Positive and Negative values
How do you split a column into two different columns based on a criteria, but maintain one key? 如何根据条件将列拆分为两个不同的列,但保留一个键? For example
例如
col1 col2 time value
0 A sdf 16:00:00 100
1 B sdh 17:00:00 -40
2 A sf 18:00:45 300
3 D sfd 20:04:33 -89
I want a new dataframe like this 我想要一个像这样的新数据帧
time main_val sub_val
0 16:00:00 100 NaN
1 17:00:00 NaN -40
2 18:00:45 300 NaN
3 20:04:33 NaN -89
I use pd.get_dummies
, mask
, and mul
我使用
pd.get_dummies
, mask
和mul
n = {True: 'main_val', False: 'sub_val'}
m = pd.get_dummies(df.value > 0).rename(columns=n)
df.drop('value', 1).join(m.mask(m == 0).mul(df.value, 0))
col1 col2 time sub_val main_val
0 A sdf 16:00:00 NaN 100.0
1 B sdh 17:00:00 -40.0 NaN
2 A sf 18:00:45 NaN 300.0
3 D sfd 20:04:33 -89.0 NaN
If you look at m.mask(m == 0)
, it becomes more clear how this works. 如果你看一下
m.mask(m == 0)
,它就会变得更加清晰。
sub_val main_val
0 NaN 1.0
1 1.0 NaN
2 NaN 1.0
3 1.0 NaN
pd.get_dummies
gives us out zeros and ones. pd.get_dummies
给出了0和1。 Then I make all the zeros into np.nan
. 然后我把所有的零都写成
np.nan
。 When I multiply with mul
, the df.value
column gets broadcast across both of these columns and we have our result. 当我乘以
mul
, df.value
列会在这两列中进行广播,我们得到了结果。 I use join
to attach it back to the dataframe. 我使用
join
将它附加回数据帧。
We can improve the speed with numpy
我们可以通过
numpy
来提高速度
v = df.value.values[:, None]
m = v > 0
n = np.where(np.hstack([m, ~m]), v, np.nan)
c = ['main_val', 'sub_val']
df.drop('value', 1).join(pd.DataFrame(n, df.index, c))
sub_val main_val
0 NaN 1.0
1 1.0 NaN
2 NaN 1.0
3 1.0 NaN
This Can even be Done By Pivot Table 这甚至可以通过数据透视表完成
df['Val1'] = np.where(df.value >=0,'main_val','sub_val' )
df = pd.pivot_table(df,index='time', values='value',
columns=['Val1'], aggfunc=np.sum).reset_index()
df = pd.DataFrame(df.values)
df.columns = ['time','main_val','sub_val']
Use DataFrame.where 使用DataFrame.where
import pandas as pd
df = pd.DataFrame({'col1':['A', 'B', 'A', 'D'],
'col2':['sdf', 'sdh', 'sf', 'sfd'],
'time':['16:00:00', '17:00:00', '18:00:45', '20:04:33'],
'value':[100, -40, 300, -89]})
print(df)
col1 col2 time value
0 A sdf 16:00:00 100
1 B sdh 17:00:00 -40
2 A sf 18:00:45 300
3 D sfd 20:04:33 -89
. 。
new = df[['time']].copy()
new['main_val'] = df['value'].where(df['value'] > 0)
new['sub_val'] = df['value'].where(df['value'] < 0)
print(new)
time main_val sub_val
0 16:00:00 100.0 NaN
1 17:00:00 NaN -40.0
2 18:00:45 300.0 NaN
3 20:04:33 NaN -89.0
use numpy where when creating new columns to pick from nans or column values (slightly faster than df.where, inspired by the excellent answer from Kamaraju Kusumanchi) 使用numpy在创建新列时从nans或列值中选择(比df.where快一点,灵感来自Kamaraju Kusumanchi的优秀答案)
vals = df.value.values
nans = np.full(len(df), np.nan)
df2 = df[['time']].copy()
df2['main_val'] = np.where(vals < 0, nans, vals)
df2['sub_val'] = np.where(vals >= 0, nans, vals)
print(df2)
time main_val sub_val
0 16:00:00 100.0 NaN
1 17:00:00 NaN -40.0
2 18:00:45 300.0 NaN
3 20:04:33 NaN -89.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.