简体   繁体   中英

Matplotlib - Time Series Analysis Python

I'm trying to create 2 types of time series using this data ( https://gist.github.com/datomnurdin/33961755b306bc67e4121052ae87cfbc ). First how many count per day. Second total sentiments per day.

Code for second total sentiments per day.

import matplotlib.pyplot as plt
import pandas as pd

df = pd.read_csv('data_filtered.csv', parse_dates=['date'], index_col='date')

def plot_df(df, x, y, title="", xlabel='Date', ylabel='Value', dpi=100):
    plt.figure(figsize=(16,5), dpi=dpi)
    plt.plot(x, y, color='tab:red')
    plt.gca().set(title=title, xlabel=xlabel, ylabel=ylabel)
    plt.show()

plot_df(df, x=df.index, y=df.sentiment, title='Sentiment Over Time')

The 2nd time-series graph looks not making any sense for me. Also possible to save the figure for future reference.

在此处输入图像描述

Try checking the source data.


date

If I try to plot a distribution of date with the following code:

import matplotlib.pyplot as plt
import pandas as pd

df = pd.read_csv('data_filtered.csv', parse_dates = ['date'])

df['date'].hist()
plt.show()

I get:

在此处输入图像描述

As you can see, most of the date values are concentrated around 2020-05-19 or 2020-05-30 , nothing in between. So, it makes sense to see points on only on the left and on the right side of your graph, not in the middle.


sentiment

If I try to plot a distribution of sentiment with the following code:

import matplotlib.pyplot as plt
import pandas as pd

df = pd.read_csv('data_filtered.csv', parse_dates = ['date'])

df['sentiment'].hist()
plt.show()

I get:

在此处输入图像描述

As you can see, the sentiment values are concentrated in three groups: -1 , 0 and 1 ; no other value. So, it makes sense to see points only at the bottom, at the center and at the top of you graph, not anywhere else.


scatterplot

Finally, I try to combine date and sentiment in a scatter plot:

import matplotlib.pyplot as plt
import pandas as pd

df = pd.read_csv('data_filtered.csv', parse_dates = ['date'])

fig, ax = plt.subplots(1, 1, figsize = (16, 5))

ax.plot(df['date'], df['sentiment'], 'o', markersize = 15)
ax.set_title('Sentiment Over Time')
ax.set_xlabel('Date')
ax.set_ylabel('Value')

plt.show()

And I get:

在此处输入图像描述

It is exactly your graph, but the points are not connected by a line. You can see how the values are concentrated in specific regions and are not scattered.


cumulative

If you want to aggregate the sentiment value by the date , check this code:

import matplotlib.pyplot as plt
import pandas as pd

df = pd.read_csv('data_filtered.csv', parse_dates = ['date'])

df_cumulate = df.groupby(['date']).sum()

def plot_df(df, x, y, title="", xlabel='Date', ylabel='Value', dpi=100):
    plt.figure(figsize=(16,5), dpi=dpi)
    plt.plot(x, y, color='tab:red')
    plt.gca().set(title=title, xlabel=xlabel, ylabel=ylabel)
    plt.savefig('graph.png')
    plt.show()

plot_df(df_cumulate, x=df_cumulate.index, y=df_cumulate.sentiment, title='Sentiment Over Time')

I aggregate the data through the line df = pd.read_csv('data.csv', parse_dates = ['date']) ; here the plot of the cumulative of the sentiment over time:

在此处输入图像描述

The data that you linked to has eight separate dates.

If you simply copy/paste, the dates are not interpreted as timepoints, but rather as strings.

you can change this by converting into datetime objects:

#convert to datetime
df['date'] = pd.to_datetime(df['date'])

The connections across the plot come from the fact that the index of the a datapoint determines when it is plotted, but the value of its x-coordinate (here: date) determines where it is plotted. Since plt.plot is a method that connects datapoints, datapoints that are plotted one after another will be connected with a line, irrespective of where they will end up. You can align timepoint and position by sorting the data:

#then sort by date
df.sort_values(by='date', inplace=True)

This doesn't make for an easily interpretable plot, but now at least you know what lines come from where:

在此处输入图像描述

A nicer way of plotting the data would be a stacked bar chart:

a=df.groupby(['date', 'sentiment']).agg(len).unstack()
a.columns = ['-1', '0', '1']
a[['-1', '0', '1']].plot(kind='bar', stacked=True)

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM