I'm trying to add a new column to a dataframe with only unique values from an existing column. There will be fewer rows in the new column maybe with np.nan values where duplicates would have been.
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,2,3,4,5], 'b':[3,4,3,4,5]})
df
a b
0 1 3
1 2 4
2 3 3
3 4 4
4 5 5
Goal:
a b c
0 1 3 3
1 2 4 4
2 3 3 nan
3 4 4 nan
4 5 5 5
I've tried:
df['c'] = np.where(df['b'].unique(), df['b'], np.nan)
It throws: operands could not be broadcast together with shapes (3,) (5,) ()
mask
+ duplicated
You can use Pandas methods for masking a series:
df['c'] = df['b'].mask(df['b'].duplicated())
print(df)
a b c
0 1 3 3.0
1 2 4 4.0
2 3 3 NaN
3 4 4 NaN
4 5 5 5.0
Use duplicated
with np.where
:
df['c'] = np.where(df['b'].duplicated(),np.nan,df['b'])
Or:
df['c'] = df['b'].where(~df['b'].duplicated(),np.nan)
print(df)
a b c
0 1 3 3.0
1 2 4 4.0
2 3 3 NaN
3 4 4 NaN
4 5 5 5.0
ppg wrote:
df['c'] = df['b'].mask(df['b'].duplicated())
print(df)
a b c
0 1 3 3.0
1 2 4 4.0
2 3 3 NaN
3 4 4 NaN
4 5 5 5.0
I like the code, but the last column should also give NaN
0 1 3 3.0
1 2 4 4.0
2 3 3 NaN
3 4 4 NaN
4 5 5 NaN
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.