[英]How to plot a linear regression with datetimes on the x-axis
My DataFrame object looks like我的 DataFrame 对象看起来像
amount
date
2014-01-06 1
2014-01-07 1
2014-01-08 4
2014-01-09 1
2014-01-14 1
I would like a sort of scatter plot with time along the x-axis, and amount on the y, with a line through the data to guide the viewer's eye.我想要一种散点图,时间沿着 x 轴,数量在 y 上,用一条穿过数据的线来引导观众的眼睛。 If I use the pandas plot df.plot(style="o")
it's not quite right, because the line is not there.如果我使用熊猫图df.plot(style="o")
它不太正确,因为线不在那里。 I would like something like the examples here .我想要类似这里的例子。
note: this has a lot in common with Ian Thompson's answer but the approach is different enough to have it be a separate answer.注意:这与 Ian Thompson 的答案有很多共同点,但该方法的不同之处足以让它成为一个单独的答案。 I use the DataFrame format provided in the question and avoid changing the index.我使用问题中提供的 DataFrame 格式并避免更改索引。
Seaborn and other libraries don't deal as well with datetime axes as you might like them to. Seaborn 和其他库不会像您希望的那样处理日期时间轴。 Here's how I'd work around it:这是我解决它的方法:
Seaborn will deal better with these than with dates. Seaborn 会比处理日期更好地处理这些问题。 This is a handy trick for doing all kind of mathy things with dates and libraries that don't love dates.这是一个方便的技巧,可以用不喜欢日期的日期和库来做各种数学事情。
from datetime import date
df['date_ordinal'] = pd.to_datetime(df['date']).apply(lambda date: date.toordinal())
ax = seaborn.regplot(
data=df,
x='date_ordinal',
y='amount',
)
# Tighten up the axes for prettiness
ax.set_xlim(df['date_ordinal'].min() - 1, df['date_ordinal'].max() + 1)
ax.set_ylim(0, df['amount'].max() + 1)
ax.set_xlabel('date')
new_labels = [date.fromordinal(int(item)) for item in ax.get_xticks()]
ax.set_xticklabels(new_labels)
ta-daa!哒哒!
Since Seaborn has trouble with dates, I'm going to create a work-around.由于 Seaborn 在约会方面遇到问题,我将创建一个解决方法。 First, I'll make the Date column my index:首先,我将日期列作为我的索引:
# Make dataframe
df = pd.DataFrame({'amount' : [1,
1,
4,
1,
1]},
index = ['2014-01-06',
'2014-01-07',
'2014-01-08',
'2014-01-09',
'2014-01-14'])
Second, convert the index to pd.DatetimeIndex:其次,将索引转换为 pd.DatetimeIndex:
# Make index pd.DatetimeIndex
df.index = pd.DatetimeIndex(df.index)
And replace the original with it:并用它替换原来的:
# Make new index
idx = pd.date_range(df.index.min(), df.index.max())
Third, reindex with the new index (idx):第三,使用新索引(idx)重新索引:
# Replace original index with idx
df = df.reindex(index = idx)
This will produce a new dataframe with NaN values for the dates you don't have data:这将为您没有数据的日期生成一个具有 NaN 值的新数据框:
Fourth, since Seaborn doesn't play nice with dates and regression lines I'll create a row count column that we can use as our x-axis:第四,由于 Seaborn 不能很好地处理日期和回归线,我将创建一个行数列,我们可以将其用作我们的 x 轴:
# Insert row count
df.insert(df.shape[1],
'row_count',
df.index.value_counts().sort_index().cumsum())
Fifth, we should now be able to plot a regression line using 'row_count' as our x variable and 'amount' as our y variable:第五,我们现在应该能够使用 'row_count' 作为我们的 x 变量和 'amount' 作为我们的 y 变量来绘制回归线:
# Plot regression using Seaborn
fig = sns.regplot(data = df, x = 'row_count', y = 'amount')
Sixth, if you would like the dates to be along the x-axis instead of the row_count you can set the x-tick labels to the index:第六,如果您希望日期沿着 x 轴而不是 row_count,您可以将 x-tick 标签设置为索引:
# Change x-ticks to dates
labels = [item.get_text() for item in fig.get_xticklabels()]
# Set labels for 1:10 because labels has 11 elements (0 is the left edge, 11 is the right
# edge) but our data only has 9 elements
labels[1:10] = df.index.date
# Set x-tick labels
fig.set_xticklabels(labels)
# Rotate the labels so you can read them
plt.xticks(rotation = 45)
# Change x-axis title
plt.xlabel('date')
plt.show();
Hope this helps!希望这可以帮助!
datetime dtype
values must be converted to something like ordinal
datetime dtype
值必须转换为类似ordinal
sklearn.linear_model.LinearRegression
and then adding the regression line with matplotlib.pyplot.plot
这可以通过使用sklearn.linear_model.LinearRegression
计算模型,然后使用matplotlib.pyplot.plot
添加回归线来完成
sns.lineplot(x=[x1_date, x2_date], y=[y1, y2], label='Linear Model', color='magenta')
also works. sns.lineplot(x=[x1_date, x2_date], y=[y1, y2], label='Linear Model', color='magenta')
也有效。python 3.8.11
, pandas 1.3.2
, matplotlib 3.4.3
, sklearn 0.24.2
在python 3.8.11
、 pandas 1.3.2
、 matplotlib 3.4.3
、 sklearn 0.24.2
import yfinance as yf # for data
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
# download the data
data = yf.download('aapl', '2019-01-02', '2021-01-01')
# add an ordinal column because sklearn doesn't work with datetimes
data['ordinal'] = data.index.map(pd.Timestamp.toordinal)
# create the model
model = LinearRegression()
# extract x and y from dataframe data
x = data[['ordinal']]
y = data[['Adj Close']]
# fit the mode
model.fit(x, y)
# print the slope and intercept if desired
print('intercept:', model.intercept_[0])
print('slope:', model.coef_[0][0])
# select x1 and x2 and get the corresponding date from the index
x1 = data.ordinal.min()
x1_date = data[data.ordinal.eq(x1)].index[0]
x2 = data.ordinal.max()
x2_date = data[data.ordinal.eq(x2)].index[0]
# calculate y1, given x1
y1 = model.predict(np.array([[x1]]))[0][0]
print('y1:', y1)
# calculate y2, given x2
y2 = model.predict(np.array([[x2]]))[0][0]
print('y2:', y2)
[out]:
intercept: -90078.45713565295
slope: 0.12225139598567565
y1: 28.279040945126326
y2: 117.40030861868581
ax1 = data.plot(y='Adj Close', c='k', figsize=(15, 6), grid=True, legend=False)
ax1.plot([x1_date, x2_date], [y1, y2], label='Linear Model', c='magenta')
ax1.legend()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.