Remove low frequency items from pandas dataframe

Question

I am playing a little with Last.fm dataset. The dataset is consisting of user id, artist name, and number of plays. something like this:

    user                                        artist                  plays
0   00000c289a1829a808ac09c00daf10bc3c4e223b    betty blowtorch         2137
1   00000c289a1829a808ac09c00daf10bc3c4e223b    die Ärzte               1099
2   00000c289a1829a808ac09c00daf10bc3c4e223b    melissa etheridge       897
3   00000c289a1829a808ac09c00daf10bc3c4e223b    elvenking               717
4   00000c289a1829a808ac09c00daf10bc3c4e223b    juliette & the licks    706

Now, what I want to do is to clean this data up a bit. Since many of the names are incorrect, I want to remove artists that are played less than say 50 times by all users .

I guess, I should use groupby and try to count them. But since I am a bit new to pandas and my dataset is very big, I wanted to know what would be the best practice for removing these items.

tl;dr:
What is the best way to remove lowest occurring artists?

PS (edit):
The desired output would be a dataframe with the same schema as the input, without the artists that has been played ( Sum of their plays on all users ) less than a certain number.

PS2 : For example I have this dataset:

df = pd.DataFrame({
    'user': 3 * ('abc'),
    'artist': 3 * ('metallica', 'coldplay', 'dfj'),
    'plays': [100,24,0,48,135,10,62,38,2]
})

So we have this dataframe:

    user    artist      plays
0   a       metallica   100
1   a       coldplay     24
2   a       dfj           3
3   b       metallica    48
4   b       coldplay    135
5   b       dfj          10
6   c       metallica    62
7   c       coldplay     38
8   c       dfj           2

Now "dfj" has been played only 15 times overall . I want to remove "dfj" and return something like this:

    user    artist      plays
0   a       metallica   100
1   a       coldplay     24
3   b       metallica    48
4   b       coldplay    135
6   c       metallica    62
7   c       coldplay     38

Answer 1

I believe you need boolean indexing with GroupBy.transform for Series with aggregate values with same size like original DataFrame :

print (df.groupby('artist')['plays'].transform('sum'))
0    210
1    197
2     12
3    210
4    197
5     12
6    210
7    197
8     12
Name: plays, dtype: int64

df1 = df[df.groupby('artist')['plays'].transform('sum') > 50]
print (df1)
        user     artist  plays
0  abcabcabc  metallica    100
1  abcabcabc   coldplay     24
3  abcabcabc  metallica     48
4  abcabcabc   coldplay    135
6  abcabcabc  metallica     62
7  abcabcabc   coldplay     38

Answer 2

simplest to try, based on the understanding from the post.

>>> df
                                       user                artist  plays
0  00000c289a1829a808ac09c00daf10bc3c4e223b       betty blowtorch   2137
1  00000c289a1829a808ac09c00daf10bc3c4e223b             die Ärzte   1099
2  00000c289a1829a808ac09c00daf10bc3c4e223b     melissa etheridge    897
3  00000c289a1829a808ac09c00daf10bc3c4e223b             elvenking    717
4  00000c289a1829a808ac09c00daf10bc3c4e223b  juliette & the licks    706

Result:

>>> df[(df['plays'] >897)]
                                       user           artist  plays
0  00000c289a1829a808ac09c00daf10bc3c4e223b  betty blowtorch   2137
1  00000c289a1829a808ac09c00daf10bc3c4e223b        die Ärzte   1099

Remove low frequency items from pandas dataframe

Question

2 answers

solution1
2 ACCPTED 2018-11-03 12:06:07

solution2
0 2018-11-03 12:06:10

Remove low frequency items from pandas dataframe

Question

2 answers

solution1 2 ACCPTED 2018-11-03 12:06:07

solution2 0 2018-11-03 12:06:10

solution1
2 ACCPTED 2018-11-03 12:06:07

solution2
0 2018-11-03 12:06:10