簡體   English   中英

在Pandas Dataframe中按半小時,小時和日分組推文

[英]Grouping Tweets by Half-Hour, Hour, and Day in Pandas Dataframe

我正在使用Twitter數據進行情感分析項目,我遇到了關於日期的小問題。 代碼本身運行正常,但我不知道如何構建自定義時間塊來分組我的最終數據。 現在,默認是將它們分組到第二個,這不是很有用。 我希望能夠在半小時,小時和天段中對它們進行分組......

請隨意跳到代碼的底部,看看問題出在哪里!

這是代碼:

import tweepy
API_KEY = "XXXXX"
API_SECRET = XXXXXX"
auth = tweepy.AppAuthHandler(API_KEY, API_SECRET)
api = tweepy.API(auth, wait_on_rate_limit = True, wait_on_rate_limit_notify = True)
import sklearn as sk
import pandas as pd
import got3
  #"Get Old Tweets" to find older data

tweetCriteria = got3.manager.TweetCriteria() 
tweetCriteria.setQuerySearch("Kentucky Derby")
tweetCriteria.setSince("2016-05-07") 
tweetCriteria.setUntil("2016-05-08")
tweetCriteria.setMaxTweets(1000)

TweetCriteria = got3.manager.TweetCriteria()
KYDerby_tweets = got3.manager.TweetManager.getTweets(tweetCriteria)

from afinn import Afinn
afinn = Afinn()
    #getting afinn library to use for sentiment polarity analysis

for x in KYDerby_tweets:
    Text = x.text
    Retweets = x.retweets
    Favorites = x.favorites
    Date = x.date
    Id = x.id
    print(Text)

AllText = []
AllRetweets = []
AllFavorites = []
AllDates = []
AllIDs = []
for x in KYDerby_tweets:
    Text = x.text
    Retweets = x.retweets
    Favorites = x.favorites
    Date = x.date
    AllText.append(Text)
    AllRetweets.append(Retweets)
    AllFavorites.append(Favorites)
    AllDates.append(Date)
    AllIDs.append(Id)

data_set = [[x.id, x.date, x.text, x.retweets, x.favorites] 
        for x in KYDerby_tweets]
df = pd.DataFrame(data=data_set, columns=["Id", "Date", "Text", "Favorites", "Retweets"])
    #I now have a DataFrame with my basic info in it

pscore = []
for x in KYDerby_tweets:
    afinn.score(x.text)
    pscore.append(afinn.score(x.text))
df['P Score'] = pscore
    #I now have the pscores for each Tweet in the DataFrame

nrc = pd.read_csv('C:\\users\\andrew.smith\\downloads\\NRC-emotion-lexicon-wordlevel-alphabetized-v0.92.txt', sep="\t", names=["word", "emotion", "association"], skiprows=45)
    #import NRC emotion lexicon

nrc = nrc[nrc["association"]==1]
nrc = nrc[nrc["emotion"].isin(["positive", "negative"]) == False]
    #cleaned it up a bit

from nltk import TweetTokenizer
tt = TweetTokenizer()
tokenized = [x.lower() for x in tokenized]
    #built my Tweet-specific, NRC-ready tokenizer

emotions = list(set(nrc["emotion"]))
index2emotion = {}
emotion2index = {}

for i in range(len(emotions)):
    index2emotion[i] = emotions[i]
    emotion2index[emotions[i]] = i  
cv = [0] * len(emotions)
    #built indices showing locations of emotions

for token in tokenized:
    sub = nrc[nrc['word'] == token]
   token_emotions = sub['emotion']
   for e in token_emotions:
       position_index = emotion2index[e]
       cv[position_index]+=1

emotions = list(set(nrc['emotion']))
index2emotion = {}
emotion2index = {}
for i in range(len(emotions)):
    index2emotion[i] = emotions[i]
    emotion2index[emotions[i]] = i

def makeEmoVector(tweettext):
    cv = [0] * len(emotions)
    tokenized = tt.tokenize(tweettext)
    tokenized = [x.lower() for x in tokenized]
    for token in tokenized:
        sub = nrc[nrc['word'] == token]
        token_emotions = sub['emotion']
        for e in token_emotions:
            position_index = emotion2index[e]
            cv[position_index] += 1
    return cv

tweettext = df.iloc[14,:]['Text']

emotion_vectors = []

for text in df['Text']:
    emotion_vector = makeEmoVector(text)
    emotion_vectors.append(emotion_vector)

ev = pd.DataFrame(emotion_vectors, index=df.index, columns=emotions)
    #Now I have a DataFrame with all of the emotion counts for each tweet

Date_Group = df.groupby("Date")
Date_Group[emotions].agg("sum")
    #Finally, we arrive at the problem!  When I run this, I end up with tweets that are grouped *by the second.  What I want is to be able to group them: a) by the half-hour, b) by the hour, and c) by the day

因為,使用Tweepy API的推文的默認日期格式是“2017-04-14 18:41:56”。 要按小時分組推文,您可以執行以下簡單操作:

# This will get the time parameter
time = [item.split(" ")[1] for item in df['date'].values] 

# This will get the hour parameter
hour = [item.split(":")[0] for item in time]

df['time'] = hour
grouped_tweets = df[['time', 'number_tweets']].groupby('time')
tweet_growth_hour = grouped_tweets.sum()
tweet_growth_hour['time']= tweet_growth_hour.index
print tweet_growth_hour

要按日期分組,您可以執行類似的操作:

days = [item.split(" ")[0] for item in df['date'].values]
df['days'] = days
grouped_tweets = df[['days', 'number_tweets']].groupby('days')
tweet_growth_days = grouped_tweets.sum()
tweet_growth_days['days']= tweet_growth_days.index
print tweet_growth_days

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM