[英]Matplotlib - Time Series Analysis Python
I'm trying to create 2 types of time series using this data ( https://gist.github.com/datomnurdin/33961755b306bc67e4121052ae87cfbc ).我正在尝试使用此数据创建 2 种类型的时间序列( https://gist.github.com/datomnurdin/33961755b306bc67e4121052ae87cfbc )。 First how many count per day.
首先每天计数多少。 Second total sentiments per day.
每天第二总情绪。
Code for second total sentiments per day.每天第二总情绪的代码。
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('data_filtered.csv', parse_dates=['date'], index_col='date')
def plot_df(df, x, y, title="", xlabel='Date', ylabel='Value', dpi=100):
plt.figure(figsize=(16,5), dpi=dpi)
plt.plot(x, y, color='tab:red')
plt.gca().set(title=title, xlabel=xlabel, ylabel=ylabel)
plt.show()
plot_df(df, x=df.index, y=df.sentiment, title='Sentiment Over Time')
The 2nd time-series graph looks not making any sense for me.第二个时间序列图对我来说似乎没有任何意义。 Also possible to save the figure for future reference.
也可以保存该图以供将来参考。
Try checking the source data.尝试检查源数据。
date日期
If I try to plot a distribution of date
with the following code:如果我尝试 plot 使用以下代码分配
date
:
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('data_filtered.csv', parse_dates = ['date'])
df['date'].hist()
plt.show()
I get:我得到:
As you can see, most of the date
values are concentrated around 2020-05-19
or 2020-05-30
, nothing in between.如您所见,大多数
date
值都集中在2020-05-19
或2020-05-30
左右,两者之间没有任何关系。 So, it makes sense to see points on only on the left and on the right side of your graph, not in the middle.因此,仅在图表的左侧和右侧而不是在中间查看点是有意义的。
sentiment情绪
If I try to plot a distribution of sentiment
with the following code:如果我尝试使用以下代码对 plot 进行
sentiment
分布:
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('data_filtered.csv', parse_dates = ['date'])
df['sentiment'].hist()
plt.show()
I get:我得到:
As you can see, the sentiment
values are concentrated in three groups: -1
, 0
and 1
;如您所见,
sentiment
值集中在三组1
-1
和0
; no other value.没有其他价值。 So, it makes sense to see points only at the bottom, at the center and at the top of you graph, not anywhere else.
因此,仅在图形的底部、中心和顶部查看点是有意义的,而不是其他任何地方。
scatterplot散点图
Finally, I try to combine date
and sentiment
in a scatter plot:最后,我尝试将
date
和sentiment
组合在一个散点 plot 中:
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('data_filtered.csv', parse_dates = ['date'])
fig, ax = plt.subplots(1, 1, figsize = (16, 5))
ax.plot(df['date'], df['sentiment'], 'o', markersize = 15)
ax.set_title('Sentiment Over Time')
ax.set_xlabel('Date')
ax.set_ylabel('Value')
plt.show()
And I get:我得到:
It is exactly your graph, but the points are not connected by a line.这正是您的图表,但这些点没有通过线连接。 You can see how the values are concentrated in specific regions and are not scattered.
您可以看到这些值是如何集中在特定区域而不是分散的。
cumulative累积
If you want to aggregate the sentiment
value by the date
, check this code:如果要按
date
聚合sentiment
值,请检查以下代码:
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('data_filtered.csv', parse_dates = ['date'])
df_cumulate = df.groupby(['date']).sum()
def plot_df(df, x, y, title="", xlabel='Date', ylabel='Value', dpi=100):
plt.figure(figsize=(16,5), dpi=dpi)
plt.plot(x, y, color='tab:red')
plt.gca().set(title=title, xlabel=xlabel, ylabel=ylabel)
plt.savefig('graph.png')
plt.show()
plot_df(df_cumulate, x=df_cumulate.index, y=df_cumulate.sentiment, title='Sentiment Over Time')
I aggregate the data through the line df = pd.read_csv('data.csv', parse_dates = ['date'])
;我通过
df = pd.read_csv('data.csv', parse_dates = ['date'])
行聚合数据; here the plot of the cumulative of the sentiment
over time:这里是
sentiment
随时间累积的 plot:
The data that you linked to has eight separate dates.您链接到的数据有八个不同的日期。
If you simply copy/paste, the dates are not interpreted as timepoints, but rather as strings.如果您只是复制/粘贴,则日期不会被解释为时间点,而是被解释为字符串。
you can change this by converting into datetime objects:您可以通过转换为日期时间对象来更改它:
#convert to datetime
df['date'] = pd.to_datetime(df['date'])
The connections across the plot come from the fact that the index of the a datapoint determines when it is plotted, but the value of its x-coordinate (here: date) determines where it is plotted. plot 之间的连接来自一个事实,即 a 数据点的索引决定了它何时被绘制,但它的 x 坐标值(这里:日期)决定了它的绘制位置。 Since plt.plot is a method that connects datapoints, datapoints that are plotted one after another will be connected with a line, irrespective of where they will end up.
由于 plt.plot 是一种连接数据点的方法,因此一个接一个地绘制的数据点将用一条线连接起来,而不管它们最终会在哪里结束。 You can align timepoint and position by sorting the data:
您可以通过对数据进行排序来对齐时间点和 position:
#then sort by date
df.sort_values(by='date', inplace=True)
This doesn't make for an easily interpretable plot, but now at least you know what lines come from where:这并不构成易于解释的 plot,但现在至少您知道哪些行来自哪里:
A nicer way of plotting the data would be a stacked bar chart:绘制数据的更好方法是堆积条形图:
a=df.groupby(['date', 'sentiment']).agg(len).unstack()
a.columns = ['-1', '0', '1']
a[['-1', '0', '1']].plot(kind='bar', stacked=True)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.