简体   繁体   English

无法使用 Matplotlib 和 Python 在 x 轴上绘制带有时间的直方图

[英]Unable to plot histogram with time on x-axis using Matplotlib and Python

I am trying to plot the number of times a user tweeted at specific times of the day.我试图绘制用户在一天中的特定时间发推文的次数。 I plan to plot these on a histogram/bar chart with 24 "bins" - one for each hour.我计划将这些绘制在直方图/条形图上,其中包含 24 个“垃圾箱”——每小时一个。

I have the data in a Pandas dataframe in two columns - the tweet and the time of the tweet (as a datetime object).我将 Pandas 数据框中的数据分为两列 - 推文和推文的时间(作为日期时间对象)。

I have converted the Time column into a Pandas time, however I am having a hard time plotting correctly.我已将时间列转换为 Pandas 时间,但是我很难正确绘制。 If I set the value of bins to be 24, then I get the following chart ( here ) which doesn't look correct.如果我将 bins 的值设置为 24,那么我会得到以下看起来不正确的图表(此处)。 Firstly the chart looks wrong, but secondly the x-axis has horrible formatting.首先图表看起来不对,其次 x 轴的格式很糟糕。

I would like to try to resolve these two issues.我想尝试解决这两个问题。 Firstly the data isn't being plotted correctly and secondly the horizontal axis formatting is incorrect.首先数据未正确绘制,其次水平轴格式不正确。

I have plotted the data using Google Sheets and the correct chart should look like this .我已经使用 Google Sheets 绘制了数据,正确的图表应该如下所示 I don't mind if the values are % of total or absolute volume.我不介意这些值是占总体积的百分比还是绝对体积。

Code to generate plots can be found here.可以在此处找到生成绘图的代码。 generate_data.py and plot_data.py generate_data.pyplot_data.py

Any help is hugely appreciated.非常感谢任何帮助。

plot_data.py绘图数据.py

import datetime
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
import pandas as pd
import random

import generate_data


screen_name = "@joebiden"
data = generate_data.get_data(screen_name, save_output=True)

df = pd.DataFrame(data)
df["Time"]= pd.to_datetime(data["Time"], format="%H:%M") 

fig, ax = plt.subplots(1,1)
bin_list = [datetime.time(x) for x in range(24)]

ax.hist(df["Time"], bins=24, color='lightblue')
plt.show()

generate_data.py生成数据.py

import json
import re
from datetime import datetime

import tweepy

import common_words
import twitter_auth 



def create_connection():
    auth = tweepy.OAuthHandler(twitter_auth.CONSUMER_KEY, twitter_auth.CONSUMER_SECRET)
    auth.set_access_token(twitter_auth.ACCESS_KEY, twitter_auth.ACCESS_SECRET)
    return tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)


def retrieve_next_set_of_tweets(screen_name, api, count,max_id):
    '''Return next 200 user tweets'''
    return api.user_timeline(screen_name=screen_name,count=count, tweet_mode='extended', max_id=max_id)


def get_tweet_times(screen_name, api):
    user_tweet_count = api.get_user(screen_name).statuses_count

    all_tweets = {'Tweet':[], 'Time':[]}
    block_of_tweets = api.user_timeline(screen_name=screen_name,count=200, tweet_mode='extended')
    all_tweets["Tweet"].extend([tweet.full_text for tweet in block_of_tweets])
    all_tweets["Time"].extend([tweet.created_at for tweet in block_of_tweets])
    oldest = block_of_tweets[-1].id - 1

    
    while block_of_tweets:    
        try:
            block_of_tweets = retrieve_next_set_of_tweets(screen_name, api, 200, oldest)
            oldest = block_of_tweets[-1].id - 1
        except IndexError: #Reached limit of 3245
            pass
        # all_tweets.update({tweet.full_text: tweet.created_at.time() for tweet in block_of_tweets}) 
        all_tweets["Tweet"].extend([tweet.full_text for tweet in block_of_tweets])
        all_tweets["Time"].extend([tweet.created_at for tweet in block_of_tweets])
 
    return all_tweets


def get_all_tweets(screen_name, api):
    user_tweet_count = api.get_user(screen_name).statuses_count

    all_tweets = []
    block_of_tweets = api.user_timeline(screen_name=screen_name,count=200, tweet_mode='extended')
    all_tweets.extend([tweet.full_text for tweet in block_of_tweets])
    oldest = block_of_tweets[-1].id - 1
    

    while block_of_tweets:    
        try:
            block_of_tweets = retrieve_next_set_of_tweets(screen_name, api, 200, oldest)
            oldest = block_of_tweets[-1].id - 1
        except IndexError: #Reached limit of 3245
            pass

        all_tweets.extend([tweet.full_text for tweet in block_of_tweets]) 
    return all_tweets


def parse_all_tweets(tweet_list, max_words_to_show=50):
    tweet_dict = {}
    regex = re.compile('[^a-zA-Z ]')
    for tweet in tweet_list: 
        text = regex.sub("", tweet).lower().strip().split()
        
        for word in text:
            if word in common_words.MOST_COMMON_ENGLISH_WORDS: continue
            if word in tweet_dict.keys():
                tweet_dict[word] += 1
            else:
                if len(tweet_dict.items()) == max_words_to_show:
                    return tweet_dict
                tweet_dict[word] = 1
    return tweet_dict
    
    
def get_data(screen_name, words_or_times="t", save_output=False):
    api = create_connection()
    print(f"...Getting max of 3245 tweets for {screen_name}...")

    if words_or_times == "t":
        all_tweets = get_tweet_times(screen_name, api)
        suffix = "tweet_times"
        
    elif words_or_times == "w":
        suffix = "ranked_words"
        parsed_tweets = parse_all_tweets(get_all_tweets(screen_name, api))
        parsed_tweets = {k:v for k,v in sorted(parsed_tweets.items(), key=lambda item: item[1], reverse=True)}

    else:
        return "...Error. Please enter 't' or 'w' to signify 'times' or 'words'."

    if save_output:
        f_name = f"{screen_name}_{suffix}.json"
        with open(f_name, "w") as f:
            json.dump(all_tweets if words_or_times == "t" else parsed_tweets, f, indent=4, default=str)
        
        print(f"...Complete! File saved as '{f_name}'")

    return all_tweets if words_or_times == "t" else parsed_tweets


if __name__ == "__main__":
    get_data(screen_name="@joebiden", save_output=True) 

OK.好的。 So you want to have time only from your date time.所以你只想从你的日期时间开始。 Try replacing尝试更换

df["Time"]= pd.to_datetime(data["Time"], format="%H:%M")

With

df['Time'] = pd.to_datetime(df['Time'],format= '%H:%M' ).dt.time

I have tried plotting the data you have shared in the comment to the answer by Serge de Gosson de Varennes.我曾尝试绘制您在 Serge de Gosson de Varennes 的回答的评论中分享的数据 The only thing I needed to change in your plot_data.py script was the date format where I added seconds.我唯一需要在plot_data.py脚本中更改的是我添加秒的日期format The rest worked as expected, the times are processed correctly for the x-axis.其余部分按预期工作,x 轴的时间处理正确。

Here is an example where the histogram is created with pd.Series.hist for convenience.这是一个示例,其中为了方便起见,使用pd.Series.hist创建直方图。 The weights argument is included to produce a graph with percentages.包含weights参数以生成带有百分比的图表。 Most of the code is for formatting:大部分代码用于格式化:

import numpy as np                 # v 1.19.2
import pandas as pd                # v 1.1.3
import matplotlib.pyplot as plt    # v 3.3.2
import matplotlib.dates as mdates

# Import data from html table into pandas dataframe
url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vTc97VEzlfDP_jEkjC7dTbJzcLBLDQeFwPMg6E36BaiH5qkhnedSz8wsVGUMyW6kt85rD20BcTMbvqp/pubhtml'
table, = pd.read_html(url, header=[1], index_col=1)
df = table.iloc[:, 1:]

# Save time variable as a pandas series of datetime dtype
time_var = pd.to_datetime(df['Time'], format='%H:%M:%S')

# Plot variable with pandas histogram function
ax = time_var.hist(bins=24, figsize=(10,5), grid=False, edgecolor='white', zorder=2,
                   weights=np.ones(time_var.size)/time_var.size)

# Format x and y-axes tick labels
ax.xaxis.set_major_formatter(mdates.DateFormatter('%H:%M'))
ax.yaxis.set_major_formatter('{x:.1%}')

# Additional formatting
alpha = 0.3
ax.grid(axis='y', zorder=1, color='black', alpha=alpha)
for spine in ['top', 'right', 'left']:
    ax.spines[spine].set_visible(False)
ax.spines['bottom'].set_alpha(alpha)
ax.tick_params(axis='x', which='major', color=[0, 0, 0, alpha])
ax.tick_params(axis='y', which='major', length=0)
ax.set_title('Tweets sent per hour of day in UTC', fontsize=14, pad=20)
ax.set_ylabel('Relative frequency (% of total)', size=12, labelpad=10)

plt.show()

推文_per_hour

Because the counts are spread over 24 hours in this histogram, you may notice that the height of the bars are slightly different from those in the histogram in the image you have shared as a reference where the counts seem to be grouped into 23 bins instead of 24.由于此直方图中的计数分布在 24 小时内,您可能会注意到条形的高度与您共享作为参考的图像中的直方图中的条形高度略有不同,其中计数似乎被分组为 23 个箱,而不是24.



Reference: this answer by ImportanceOfBeingErnest参考: ImportanceOfBeingErnest 这个答案

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM