python/pandas: update a column based on a series holding sums of that same column

Question

I have a dataframe with a non-unique col1 like the following

    col1    col2
0      a      1
1      a      1
2      a      2
3      b      3
4      b      3
5      c      2
6      c      2

Some of the values of col1 repeat lots of times and others not so. I'd like to take the bottom (80%/50%/10%) and change the value to 'other' ahead of plotting.

I've got a series which contains the codes in col1 (as the index) and the amount of times that they appear in the df in descending order by doing the following:

df2 = df.groupby(['col1']).size().sort_values(ascending=False)

I've also got my cut-off point (bottom 80%)

cutOff = round(len(df2)/5)

I'd like to update col1 in df with the value 'others' when col1 appears after the cutOff in the index of the series df2.

I don't know how to go about checking and updating. I figured that the best way would be to do a groupby on col1 and then loop through, but it starts to fall apart, should I create a new groupby object? Or do I call this as an.apply() for each row? Can you update a column that is being used as the index for a dataframe? I could do with some help about how to start.

edit to add:

So if the 'b's in col1 were not in the top 20% most populous values in col1 then I'd expect to see:

    col1    col2
0      a      1
1      a      1
2      a      2
3 others      3
4 others      3
5      c      2
6      c      2

Answer 1

data = [["a ", 1],
        ["a ", 1],
        ["a ", 2],
        ["b ", 3],
        ["b ", 3],
        ["c ", 2],
        ["c ", 2], ]
df = pd.DataFrame(data, columns=["col1", "col2"])
print(df)

df2 = df.groupby(['col1']).size().sort_values(ascending=False)
print(df2)

cutOff = round(len(df2) / 5)
others = df2.iloc[cutOff + 1:]
print(others)

result = df.copy()
result.loc[result["col1"].isin(others.index), "col1"] = "others"
print(result)

python/pandas: update a column based on a series holding sums of that same column

Question

1 answers

solution1
0 ACCPTED 2020-12-21 11:57:24

python/pandas: update a column based on a series holding sums of that same column

Question

1 answers

solution1 0 ACCPTED 2020-12-21 11:57:24

solution1
0 ACCPTED 2020-12-21 11:57:24