简体   繁体   中英

Pandas Weighted Mean

I am trying to process some twitter sentiment data. To do so I took a look at pandas . What I have done so far is calculate the mean of all the data for each date. But I would also like to use score to created a weighted mean per day. So if the score for a tweet is 2 it should influence the total result like 2 tweets.

tweets = [{'tweet_user_verified': 1, 'tweet_user_id': 14631115, 'tweet_favorite_count': 17048, 'tweet_sentiment': 1, 'tweet_retweet_count': 4842, 'tweet_id': 698155842877702144, 'tweet_date': datetime.datetime(2016, 2, 12, 15, 44, 58), 'tweet_lang': 'en', 'tweet_text': 'they fixed my iphone but now the z and the s do nt work ! they re like two of the best letters'}, {'tweet_user_verified': 0, 'tweet_user_id': 73518190, 'tweet_favorite_count': 1, 'tweet_sentiment': 1, 'tweet_retweet_count': 0, 'tweet_id': 698179827900125185, 'tweet_date': datetime.datetime(2016, 2, 12, 17, 20, 17), 'tweet_lang': 'en', 'tweet_text': 'ihs is nt ghetto , it s full of suburban kids who think their ghetto cause their parents ca nt afford to buy them a new iphone .'}, {'tweet_user_verified': 0, 'tweet_user_id': 1832492197, 'tweet_favorite_count': 2, 'tweet_sentiment': 5, 'tweet_retweet_count': 0, 'tweet_id': 698179203376623616, 'tweet_date': datetime.datetime(2016, 2, 12, 17, 17, 48), 'tweet_lang': 'en', 'tweet_text': 'how to brick iphone 5s and above 1. set the date and time to january 1st , 1970 on the device 2. restart 3. profit'}, {'tweet_user_verified': 0, 'tweet_user_id': 70582223, 'tweet_favorite_count': 18, 'tweet_sentiment': 5, 'tweet_retweet_count': 14, 'tweet_id': 698178066292539392, 'tweet_date': datetime.datetime(2016, 2, 12, 17, 13, 17), 'tweet_lang': 'en', 'tweet_text': 'iphone battery s go from 100-75 in seconds'}, {'tweet_user_verified': 0, 'tweet_user_id': 31050061, 'tweet_favorite_count': 72, 'tweet_sentiment': 1, 'tweet_retweet_count': 40, 'tweet_id': 698176382417903618, 'tweet_date': datetime.datetime(2016, 2, 12, 17, 6, 35), 'tweet_lang': 'en', 'tweet_text': 'a @ tmobile iphone ad featuring a woman wearing hijab is up on the # nyc subway platform walls pic.twitter.com/lsb0dzfymd'}, {'tweet_user_verified': 0, 'tweet_user_id': 733813, 'tweet_favorite_count': 14, 'tweet_sentiment': 1, 'tweet_retweet_count': 2, 'tweet_id': 698170656203149312, 'tweet_date': datetime.datetime(2016, 2, 12, 16, 43, 50), 'tweet_lang': 'en', 'tweet_text': 'if a modern 4″ iphone does arrive in march i might go for it . would miss 5″ screen but reaching the top left is a constant micro-irritant .'}, {'tweet_user_verified': 0, 'tweet_user_id': 3098026668, 'tweet_favorite_count': 11, 'tweet_sentiment': 1, 'tweet_retweet_count': 13, 'tweet_id': 698170562250739713, 'tweet_date': datetime.datetime(2016, 2, 12, 16, 43, 28), 'tweet_lang': 'en', 'tweet_text': 'hidden iphone 6s easter egg . pic.twitter.com/op1kqewwqv'}, {'tweet_user_verified': 1, 'tweet_user_id': 1769191, 'tweet_favorite_count': 11, 'tweet_sentiment': 1, 'tweet_retweet_count': 5, 'tweet_id': 698163741158838272, 'tweet_date': datetime.datetime(2016, 2, 12, 16, 16, 21), 'tweet_lang': 'en', 'tweet_text': 'that it took until iphone 7 for this to happen just shows you how hard it is to find a fabricator on par with samsung'}, {'tweet_user_verified': 0, 'tweet_user_id': 64334539, 'tweet_favorite_count': 8, 'tweet_sentiment': 1, 'tweet_retweet_count': 20, 'tweet_id': 698154074160697344, 'tweet_date': datetime.datetime(2016, 2, 12, 15, 37, 57), 'tweet_lang': 'en', 'tweet_text': 'rt @ iphoneteam : when your iphone corrects omw  to on my way !  pic.twitter.com/ptpnrjgqqn'}, {'tweet_user_verified': 0, 'tweet_user_id': 3003790936, 'tweet_favorite_count': 8, 'tweet_sentiment': 1, 'tweet_retweet_count': 7, 'tweet_id': 698154004451323904, 'tweet_date': datetime.datetime(2016, 2, 12, 15, 37, 40), 'tweet_lang': 'en', 'tweet_text': 'mathew brady s business dropped off significantly once old abe got an iphone . # lincolnsbirthday pic.twitter.com/b4i8pzpw7z'}, {'tweet_user_verified': 0, 'tweet_user_id': 356837905, 'tweet_favorite_count': 8, 'tweet_sentiment': 5, 'tweet_retweet_count': 8, 'tweet_id': 698153555086221312, 'tweet_date': datetime.datetime(2016, 2, 12, 15, 35, 53), 'tweet_lang': 'en', 'tweet_text': 'millennials supporting a # socialist is priceless . ca nt wait for the government to dictate who can have that iphone or internet access .'}, {'tweet_user_verified': 0, 'tweet_user_id': 2872097713, 'tweet_favorite_count': 20, 'tweet_sentiment': 5, 'tweet_retweet_count': 21, 'tweet_id': 698153222964408321, 'tweet_date': datetime.datetime(2016, 2, 12, 15, 34, 34), 'tweet_lang': 'en', 'tweet_text': 'snapchat no android / iphone pic.twitter.com/bel90svufz'}, {'tweet_user_verified': 0, 'tweet_user_id': 35453314, 'tweet_favorite_count': 8, 'tweet_sentiment': 1, 'tweet_retweet_count': 1, 'tweet_id': 698152530031853568, 'tweet_date': datetime.datetime(2016, 2, 12, 15, 31, 48), 'tweet_lang': 'en', 'tweet_text': 'has anyone seen a place close to the venue that sells iphone chargers ? hurry please , 1 % left i m running out of pow'}, {'tweet_user_verified': 0, 'tweet_user_id': 231879129, 'tweet_favorite_count': 16, 'tweet_sentiment': 1, 'tweet_retweet_count': 7, 'tweet_id': 698152206965592064, 'tweet_date': datetime.datetime(2016, 2, 12, 15, 30, 31), 'tweet_lang': 'en', 'tweet_text': 'when u get a new iphone it s lit the best thing ever bc ur battery lasts u agesssss'}, {'tweet_user_verified': 0, 'tweet_user_id': 407435062, 'tweet_favorite_count': 6, 'tweet_sentiment': 5, 'tweet_retweet_count': 5, 'tweet_id': 698151086222372865, 'tweet_date': datetime.datetime(2016, 2, 12, 15, 26, 4), 'tweet_lang': 'en', 'tweet_text': 'and iphone mad corny for mandatory capitalizing kardashian '}]

dates = [tw['tweet_date'] for tw in tweets]
sntms = [tw['tweet_sentiment'] for tw in tweets]
score = [int(1+math.log(1+tw['tweet_retweet_count']+tw['tweet_favorite_count']+tw['tweet_user_verified'])) for tw in tweets]

ts = pd.Series(sntms, index=dates)
cv = ts.resample('D', how='mean')

From http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.resample.html :

 import numpy as np
 import pandas as pd
 import datetime as dt

 def weighted(score, data):

     return np.array(sum(score*data)/len(score))


 data = np.array(np.random.random(10))
 index = pd.date_range(start=dt.date.today(), periods=10, freq='30min')
 df = pd.DataFrame(data, index=index,columns=['col'])
 print df.resample('1h', how=(lambda a: weighted(np.ones(len(a)), a)))

you can verify that this gives the normal mean if you pass the weights as all ones.

alternatively, you can pass a row into the resample:

 def weighted2(row):
     a=row['a'].values
     b=row['b'].values
     return sum(a*b)/row.shape[0]

 score = np.ones(10)
 data = np.array(np.random.random(10))
 index = pd.date_range(start=dt.date.today(), periods=10, freq='30min')
 df = pd.DataFrame(data, index=index,columns=['a'])
 df['b'] = score
 print df.resample('1h', how=weighted2)['a']
 print df.resample('1h')

both of which give:

                             a
 2016-02-12 00:00:00  0.633469
 2016-02-12 01:00:00  0.436514
 2016-02-12 02:00:00  0.341746
 2016-02-12 03:00:00  0.745674
 2016-02-12 04:00:00  0.068618
                             a
 2016-02-12 00:00:00  0.633469
 2016-02-12 01:00:00  0.436514
 2016-02-12 02:00:00  0.341746
 2016-02-12 03:00:00  0.745674
 2016-02-12 04:00:00  0.068618
# Add scores to the sentiment.
df = pd.concat([ts, pd.Series(np.random.random_integers(1, 10, (len(ts),)), 
                              index=ts.index)], axis=1)

# Weighted daily score.
>>> df.resample('D', how=lambda x: (x.score * x.sentiment).sum() / 
                                    float(x.score.sum()))['sentiment']
2016-02-12    2.247312
Freq: D, Name: sentiment, dtype: float64

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM