I am playing a little with Last.fm dataset. The dataset is consisting of user id, artist name, and number of plays. something like this:
user artist plays
0 00000c289a1829a808ac09c00daf10bc3c4e223b betty blowtorch 2137
1 00000c289a1829a808ac09c00daf10bc3c4e223b die Ärzte 1099
2 00000c289a1829a808ac09c00daf10bc3c4e223b melissa etheridge 897
3 00000c289a1829a808ac09c00daf10bc3c4e223b elvenking 717
4 00000c289a1829a808ac09c00daf10bc3c4e223b juliette & the licks 706
Now, what I want to do is to clean this data up a bit. Since many of the names are incorrect, I want to remove artists that are played less than say 50 times by all users .
I guess, I should use groupby and try to count them. But since I am a bit new to pandas and my dataset is very big, I wanted to know what would be the best practice for removing these items.
tl;dr:
What is the best way to remove lowest occurring artists?
PS (edit):
The desired output would be a dataframe with the same schema as the input, without the artists that has been played ( Sum of their plays on all users ) less than a certain number.
PS2 : For example I have this dataset:
df = pd.DataFrame({
'user': 3 * ('abc'),
'artist': 3 * ('metallica', 'coldplay', 'dfj'),
'plays': [100,24,0,48,135,10,62,38,2]
})
So we have this dataframe:
user artist plays
0 a metallica 100
1 a coldplay 24
2 a dfj 3
3 b metallica 48
4 b coldplay 135
5 b dfj 10
6 c metallica 62
7 c coldplay 38
8 c dfj 2
Now "dfj" has been played only 15 times overall . I want to remove "dfj" and return something like this:
user artist plays
0 a metallica 100
1 a coldplay 24
3 b metallica 48
4 b coldplay 135
6 c metallica 62
7 c coldplay 38
I believe you need boolean indexing
with GroupBy.transform
for Series with aggregate values with same size like original DataFrame
:
print (df.groupby('artist')['plays'].transform('sum'))
0 210
1 197
2 12
3 210
4 197
5 12
6 210
7 197
8 12
Name: plays, dtype: int64
df1 = df[df.groupby('artist')['plays'].transform('sum') > 50]
print (df1)
user artist plays
0 abcabcabc metallica 100
1 abcabcabc coldplay 24
3 abcabcabc metallica 48
4 abcabcabc coldplay 135
6 abcabcabc metallica 62
7 abcabcabc coldplay 38
simplest to try, based on the understanding from the post.
>>> df
user artist plays
0 00000c289a1829a808ac09c00daf10bc3c4e223b betty blowtorch 2137
1 00000c289a1829a808ac09c00daf10bc3c4e223b die Ärzte 1099
2 00000c289a1829a808ac09c00daf10bc3c4e223b melissa etheridge 897
3 00000c289a1829a808ac09c00daf10bc3c4e223b elvenking 717
4 00000c289a1829a808ac09c00daf10bc3c4e223b juliette & the licks 706
Result:
>>> df[(df['plays'] >897)]
user artist plays
0 00000c289a1829a808ac09c00daf10bc3c4e223b betty blowtorch 2137
1 00000c289a1829a808ac09c00daf10bc3c4e223b die Ärzte 1099
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.