I have been running the code for like 45 mins now and is still going. Can someone please suggest to me how I can make it faster?
df4 is a panda data frame. df4.head() looks like this
df4 = pd.DataFrame({
'hashtag':np.random.randn(3000000),
'sentiment_score':np.random.choice( [0,1], 3000000),
'user_id':np.random.choice( ['11','12','13'], 3000000),
})
What I am aiming to have is a new column called rating.
len(df4.index) is 3,037,321.
ratings = []
for index in df4.index:
rowUserID = df4['user_id'][index]
rowTrackID = df4['track_id'][index]
rowSentimentScore = df4['sentiment_score'][index]
condition = ((df4['user_id'] == rowUserID) & (df4['sentiment_score'] == rowSentimentScore))
allRows = df4[condition]
totalSongListendForContext = len(allRows.index)
rows = df4[(condition & (df4['track_id'] == rowTrackID))]
songListendForContext = len(rows.index)
rating = songListendForContext/totalSongListendForContext
ratings.append(rating)
Globally, you'll need groupby
. you can either:
use two groupby
with transform
to get the size
of what you called condition
and the size
of the condition & (df4['track_id'] == rowTrackID)
, divide the second by the first:
df4['ratings'] = (df4.groupby(['user_id', 'sentiment_score','track_id'])['track_id'].transform('size')
/ df4.groupby(['user_id', 'sentiment_score'])['track_id'].transform('size'))
Or use groupby
with value_counts
with the parameter normalize=True
and merge
the result with df4:
df4 = df4.merge(df4.groupby(['user_id', 'sentiment_score'])['track_id']
.value_counts(normalize=True)
.rename('ratings').reset_index(),
how='left')
in both case, you will get the same result as your list ratings
(that I assume you wanted to be a column). I would say the second option is faster but it depends on the number of groups you have in your real case.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.