Matplotlib - 时间序列分析 Python

Question

I'm trying to create 2 types of time series using this data ( https://gist.github.com/datomnurdin/33961755b306bc67e4121052ae87cfbc ).我正在尝试使用此数据创建 2 种类型的时间序列（ https://gist.github.com/datomnurdin/33961755b306bc67e4121052ae87cfbc ）。 First how many count per day.首先每天计数多少。 Second total sentiments per day.每天第二总情绪。

Code for second total sentiments per day.每天第二总情绪的代码。

import matplotlib.pyplot as plt
import pandas as pd

df = pd.read_csv('data_filtered.csv', parse_dates=['date'], index_col='date')

def plot_df(df, x, y, title="", xlabel='Date', ylabel='Value', dpi=100):
    plt.figure(figsize=(16,5), dpi=dpi)
    plt.plot(x, y, color='tab:red')
    plt.gca().set(title=title, xlabel=xlabel, ylabel=ylabel)
    plt.show()

plot_df(df, x=df.index, y=df.sentiment, title='Sentiment Over Time')

The 2nd time-series graph looks not making any sense for me.第二个时间序列图对我来说似乎没有任何意义。 Also possible to save the figure for future reference.也可以保存该图以供将来参考。

Answer 1

Try checking the source data.尝试检查源数据。

date日期

If I try to plot a distribution of date with the following code:如果我尝试 plot 使用以下代码分配date ：

import matplotlib.pyplot as plt
import pandas as pd

df = pd.read_csv('data_filtered.csv', parse_dates = ['date'])

df['date'].hist()
plt.show()

I get:我得到：

As you can see, most of the date values are concentrated around 2020-05-19 or 2020-05-30 , nothing in between.如您所见，大多数date值都集中在2020-05-19或2020-05-30左右，两者之间没有任何关系。 So, it makes sense to see points on only on the left and on the right side of your graph, not in the middle.因此，仅在图表的左侧和右侧而不是在中间查看点是有意义的。

sentiment情绪

If I try to plot a distribution of sentiment with the following code:如果我尝试使用以下代码对 plot 进行sentiment分布：

import matplotlib.pyplot as plt
import pandas as pd

df = pd.read_csv('data_filtered.csv', parse_dates = ['date'])

df['sentiment'].hist()
plt.show()

I get:我得到：

As you can see, the sentiment values are concentrated in three groups: -1 , 0 and 1 ;如您所见， sentiment值集中在三组1 -1和0 ； no other value.没有其他价值。 So, it makes sense to see points only at the bottom, at the center and at the top of you graph, not anywhere else.因此，仅在图形的底部、中心和顶部查看点是有意义的，而不是其他任何地方。

scatterplot散点图

Finally, I try to combine date and sentiment in a scatter plot:最后，我尝试将date和sentiment组合在一个散点 plot 中：

import matplotlib.pyplot as plt
import pandas as pd

df = pd.read_csv('data_filtered.csv', parse_dates = ['date'])

fig, ax = plt.subplots(1, 1, figsize = (16, 5))

ax.plot(df['date'], df['sentiment'], 'o', markersize = 15)
ax.set_title('Sentiment Over Time')
ax.set_xlabel('Date')
ax.set_ylabel('Value')

plt.show()

And I get:我得到：

It is exactly your graph, but the points are not connected by a line.这正是您的图表，但这些点没有通过线连接。 You can see how the values are concentrated in specific regions and are not scattered.您可以看到这些值是如何集中在特定区域而不是分散的。

cumulative累积

If you want to aggregate the sentiment value by the date , check this code:如果要按date聚合sentiment值，请检查以下代码：

import matplotlib.pyplot as plt
import pandas as pd

df = pd.read_csv('data_filtered.csv', parse_dates = ['date'])

df_cumulate = df.groupby(['date']).sum()

def plot_df(df, x, y, title="", xlabel='Date', ylabel='Value', dpi=100):
    plt.figure(figsize=(16,5), dpi=dpi)
    plt.plot(x, y, color='tab:red')
    plt.gca().set(title=title, xlabel=xlabel, ylabel=ylabel)
    plt.savefig('graph.png')
    plt.show()

plot_df(df_cumulate, x=df_cumulate.index, y=df_cumulate.sentiment, title='Sentiment Over Time')

I aggregate the data through the line df = pd.read_csv('data.csv', parse_dates = ['date']) ;我通过df = pd.read_csv('data.csv', parse_dates = ['date'])行聚合数据； here the plot of the cumulative of the sentiment over time:这里是sentiment随时间累积的 plot：

Answer 2

The data that you linked to has eight separate dates.您链接到的数据有八个不同的日期。

If you simply copy/paste, the dates are not interpreted as timepoints, but rather as strings.如果您只是复制/粘贴，则日期不会被解释为时间点，而是被解释为字符串。

you can change this by converting into datetime objects:您可以通过转换为日期时间对象来更改它：

#convert to datetime
df['date'] = pd.to_datetime(df['date'])

The connections across the plot come from the fact that the index of the a datapoint determines when it is plotted, but the value of its x-coordinate (here: date) determines where it is plotted. plot 之间的连接来自一个事实，即 a 数据点的索引决定了它何时被绘制，但它的 x 坐标值（这里：日期）决定了它的绘制位置。 Since plt.plot is a method that connects datapoints, datapoints that are plotted one after another will be connected with a line, irrespective of where they will end up.由于 plt.plot 是一种连接数据点的方法，因此一个接一个地绘制的数据点将用一条线连接起来，而不管它们最终会在哪里结束。 You can align timepoint and position by sorting the data:您可以通过对数据进行排序来对齐时间点和 position：

#then sort by date
df.sort_values(by='date', inplace=True)

This doesn't make for an easily interpretable plot, but now at least you know what lines come from where:这并不构成易于解释的 plot，但现在至少您知道哪些行来自哪里：

A nicer way of plotting the data would be a stacked bar chart:绘制数据的更好方法是堆积条形图：

a=df.groupby(['date', 'sentiment']).agg(len).unstack()
a.columns = ['-1', '0', '1']
a[['-1', '0', '1']].plot(kind='bar', stacked=True)

Matplotlib - 时间序列分析 Python

问题描述

2 个解决方案

解决方案1
2 已采纳 2020-06-04 16:36:35

解决方案2
1 2020-06-04 16:31:46

Matplotlib - 时间序列分析 Python

问题描述

2 个解决方案

解决方案1 2 已采纳 2020-06-04 16:36:35

解决方案2 1 2020-06-04 16:31:46

解决方案1
2 已采纳 2020-06-04 16:36:35

解决方案2
1 2020-06-04 16:31:46