I have a DataFrame with millon of rows and a lot of NaN values. Some example:
index Company Area
0 Google Technology
1 Coca Cola Drinks
2 NaN Drinks
3 Apple Technology
4 NaN Technology
5 Gatorade Drinks
6 Dell Technology
7 Apple Technology
8 Coca Cola Drinks
9 NaN Drinks
10 Google Technology
My idea is to fill Companies NaN values with one of the 2 most common values for its Area.
From example: If the most frequent Companies in Technology area are Apple and Google, I Would like to fill the "df['Area'] == 'Technology'" NaN values with one of that values (randomly)
I've already created a Group By DataFrame with the most common values, it is something like this:
Area Company
Technology Google
Technology Apple
Drinks Coca Cola
Drinks Pepsi
The result should be something like this:
index Company Area
0 Google Technology
1 Coca Cola Drinks
2 Pepsi Drinks
3 Apple Technology
4 Google Technology
5 Gatorade Drinks
6 Dell Technology
7 Apple Technology
8 Coca Cola Drinks
9 Pepsi Drinks
10 Google Technology
I hope you can help me.
Thanks!!!
I come up with this solution by using random.choice
import random
s=df1.groupby('Area').Company.apply(list).reindex(df.Area).apply(lambda x :random.choice(x) )
s.index=df.index
df.Company=df.Company.fillna(s)
df
Out[200]:
index Company Area
0 0 Google Technology
1 1 CocaCola Drinks
2 2 CocaCola Drinks
3 3 Apple Technology
4 4 Google Technology
5 5 Gatorade Drinks
6 6 Dell Technology
7 7 Apple Technology
8 8 CocaCola Drinks
9 9 Pepsi Drinks
10 10 Google Technology
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.