I have a dataframe with a non-unique col1 like the following
col1 col2
0 a 1
1 a 1
2 a 2
3 b 3
4 b 3
5 c 2
6 c 2
Some of the values of col1 repeat lots of times and others not so. I'd like to take the bottom (80%/50%/10%) and change the value to 'other' ahead of plotting.
I've got a series which contains the codes in col1 (as the index) and the amount of times that they appear in the df in descending order by doing the following:
df2 = df.groupby(['col1']).size().sort_values(ascending=False)
I've also got my cut-off point (bottom 80%)
cutOff = round(len(df2)/5)
I'd like to update col1 in df with the value 'others' when col1 appears after the cutOff in the index of the series df2.
I don't know how to go about checking and updating. I figured that the best way would be to do a groupby on col1 and then loop through, but it starts to fall apart, should I create a new groupby object? Or do I call this as an.apply() for each row? Can you update a column that is being used as the index for a dataframe? I could do with some help about how to start.
edit to add:
So if the 'b's in col1 were not in the top 20% most populous values in col1 then I'd expect to see:
col1 col2
0 a 1
1 a 1
2 a 2
3 others 3
4 others 3
5 c 2
6 c 2
data = [["a ", 1],
["a ", 1],
["a ", 2],
["b ", 3],
["b ", 3],
["c ", 2],
["c ", 2], ]
df = pd.DataFrame(data, columns=["col1", "col2"])
print(df)
df2 = df.groupby(['col1']).size().sort_values(ascending=False)
print(df2)
cutOff = round(len(df2) / 5)
others = df2.iloc[cutOff + 1:]
print(others)
result = df.copy()
result.loc[result["col1"].isin(others.index), "col1"] = "others"
print(result)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.