Looping over rows in Pandas dataframe taking too long

Question

I have been running the code for like 45 mins now and is still going. Can someone please suggest to me how I can make it faster?

df4 is a panda data frame. df4.head() looks like this

df4 = pd.DataFrame({ 
    'hashtag':np.random.randn(3000000),
    'sentiment_score':np.random.choice( [0,1], 3000000),
    'user_id':np.random.choice( ['11','12','13'], 3000000),
    })

What I am aiming to have is a new column called rating.

len(df4.index) is 3,037,321.

ratings = []
for index in df4.index:
    rowUserID = df4['user_id'][index]
    rowTrackID = df4['track_id'][index]
    rowSentimentScore = df4['sentiment_score'][index]

    condition = ((df4['user_id'] == rowUserID) & (df4['sentiment_score'] == rowSentimentScore))
    allRows = df4[condition]
    totalSongListendForContext = len(allRows.index)

    rows = df4[(condition & (df4['track_id'] == rowTrackID))]
    songListendForContext = len(rows.index)

    rating = songListendForContext/totalSongListendForContext
    ratings.append(rating)

Answer 1

Globally, you'll need groupby . you can either:

use two groupby with transform to get the size of what you called condition and the size of the condition & (df4['track_id'] == rowTrackID) , divide the second by the first:

df4['ratings'] = (df4.groupby(['user_id', 'sentiment_score','track_id'])['track_id'].transform('size')
                   / df4.groupby(['user_id', 'sentiment_score'])['track_id'].transform('size'))

Or use groupby with value_counts with the parameter normalize=True and merge the result with df4:

df4 = df4.merge(df4.groupby(['user_id', 'sentiment_score'])['track_id']
                   .value_counts(normalize=True)
                   .rename('ratings').reset_index(),
                how='left')

in both case, you will get the same result as your list ratings (that I assume you wanted to be a column). I would say the second option is faster but it depends on the number of groups you have in your real case.

Looping over rows in Pandas dataframe taking too long

Question

1 answers

solution1
1 2020-03-13 00:13:17

Looping over rows in Pandas dataframe taking too long

Question

1 answers

solution1 1 2020-03-13 00:13:17

solution1
1 2020-03-13 00:13:17