I have a pandas DataFrame with many "object" columns where each of them contains many values (modalities). Then, I want to keep only the 10 most frequent modalities for each column and the others replace by 'Oth'.
For example, if I have a column 'obj_col1' which contains 4 different values:
obj_col1
'A'
'A'
'B'
'C'
'B'
'D'
and I want to keep 2 the most frequent, here 'A' and 'B', and replace the rest by 'Oth':
obj_col2
'A'
'A'
'B'
'Oth'
'B'
'Oth'
A piece of code for one object column (categorical variable) is:
# sorted list of modalities of 'categ_var'
list_freq_modal = df['categ_var'].value_counts().index.tolist()
# replace all the modalities except the first 10 by 'Oth'
df['categ_var'].replace(list_freq_modal[10:],'Oth', inplace=True)
But I have an error : 'NoneType' object has no attribute 'any'
Have you any idea have implement it in more optimal way ?
Instead of replace we can use value_counts.head(2)
and where
by mapping value_counts and getting the mask with notnull()
ie
x = df['obj_col1'].value_counts().head(2)
#B 2
#A 2
#Name: obj_col1, dtype: int64
df['obj_col1'].where(df['obj_col1'].map(x).notnull(),'Oth')
Output :
0 A 1 A 2 B 3 Oth 4 B 5 Oth Name: obj_col1, dtype: object
df['obj_col1'].map(x).notnull() # This will give the mask.
0 True 1 True 2 True 3 False 4 True 5 False Name: obj_col1, dtype: bool
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.