简体   繁体   中英

Looping over rows in Pandas dataframe taking too long

I have been running the code for like 45 mins now and is still going. Can someone please suggest to me how I can make it faster?

df4 is a panda data frame. df4.head() looks like this

df4 = pd.DataFrame({ 
    'hashtag':np.random.randn(3000000),
    'sentiment_score':np.random.choice( [0,1], 3000000),
    'user_id':np.random.choice( ['11','12','13'], 3000000),
    })

What I am aiming to have is a new column called rating.

len(df4.index) is 3,037,321.

ratings = []
for index in df4.index:
    rowUserID = df4['user_id'][index]
    rowTrackID = df4['track_id'][index]
    rowSentimentScore = df4['sentiment_score'][index]

    condition = ((df4['user_id'] == rowUserID) & (df4['sentiment_score'] == rowSentimentScore))
    allRows = df4[condition]
    totalSongListendForContext = len(allRows.index)

    rows = df4[(condition & (df4['track_id'] == rowTrackID))]
    songListendForContext = len(rows.index)

    rating = songListendForContext/totalSongListendForContext
    ratings.append(rating)

Globally, you'll need groupby . you can either:

use two groupby with transform to get the size of what you called condition and the size of the condition & (df4['track_id'] == rowTrackID) , divide the second by the first:

df4['ratings'] = (df4.groupby(['user_id', 'sentiment_score','track_id'])['track_id'].transform('size')
                   / df4.groupby(['user_id', 'sentiment_score'])['track_id'].transform('size'))

Or use groupby with value_counts with the parameter normalize=True and merge the result with df4:

df4 = df4.merge(df4.groupby(['user_id', 'sentiment_score'])['track_id']
                   .value_counts(normalize=True)
                   .rename('ratings').reset_index(),
                how='left')

in both case, you will get the same result as your list ratings (that I assume you wanted to be a column). I would say the second option is faster but it depends on the number of groups you have in your real case.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM