[英]Converting date row to column for last N days
I want to build a time series prediction model using features such as week of the year, day of the week, season, etc. 我想使用诸如一年中的一周,一周中的一天,季节等功能来构建时间序列预测模型。
Since the prediction will be highly affected by the most recent values, I want to use the values in the last 5 days, as features, however I am having trouble with data preparation for learning: 由于预测会受到最新值的高度影响,因此我想将过去5天的值用作功能,但是我在准备数据进行学习时遇到了麻烦:
My current table looks like this: 我当前的表如下所示:
date id score
0 2014-01-01 A 75
1 2014-01-01 B 1
2 2014-01-01 C 2
4 2014-01-02 A 84
5 2014-01-02 B 1
6 2014-01-02 C 3
8 2014-01-03 A 1
9 2014-01-03 B 1
10 2014-01-03 C 1
So I want each row to look like this: 所以我希望每一行看起来像这样:
date id score date_1 date_2 date_3 date_4 date-5
10 2014-01-03 A 1 84 75 0 0 0
9 2014-01-03 B 1 1 1 0 0 0
Date_1 is the score of A, the day before its date on 'date' column, date_2 is two days before, and so on... Date_1是A的分数,是“日期”列中日期的前一天,date_2是前两天,依此类推...
So that I can predict the next day, using the information of last 5 days and more features that are irrelevant to this question. 这样我就可以使用最近5天的信息以及与该问题无关的其他功能来预测第二天。 It is OK to fill NaN values with 0
可以用0填充NaN值
You can use groupby(id)
and shift
. 您可以使用
groupby(id)
和shift
。 You should have your df be sorted by date: df.sort_values('date')
before using the following command: 在使用以下命令之前,应
df.sort_values('date')
日期对df进行排序: df.sort_values('date')
:
for i in range(5):
df['date_'+str(i+1)] = df.groupby('id')['score'].shift(i+1).fillna(0).astype(int)
Using the above command yields the following df: 使用上面的命令将产生以下df:
Time shifting using Timedelta 使用Timedelta进行时移
The other answer is shifting by numeric index. 另一个答案是按数字索引移动。 Works in this instance, but it will break if there are gaps in the dates, or if the dates have not been sorted.
在这种情况下可以使用,但是如果日期中有空格或日期未排序,它将中断。
You can handle this by converting the DataFrame to a time series, then using the freq
parameter of DataFrame.shift()
with a pandas.Timedelta
object. 您可以通过将DataFrame转换为时间序列,然后将
DataFrame.shift()
的freq
参数与pandas.Timedelta
对象一起使用来处理此问题。
Example data: 示例数据:
import pandas as pd
df = pd.DataFrame({'date': ['2014-01-01'] * 3 +
['2014-01-02'] * 3 +
['2014-01-03'] * 3,
'id': ['A', 'B', 'C'] * 3,
'score': [75, 1, 2, 84, 1, 3, 1, 1, 1]})
df.date = pd.to_datetime(df.date)
df.set_index('date', inplace=True)
The IDs mean we need a couple of loops to keep everything separate: 这些ID意味着我们需要几个循环来使所有内容分开:
for i in range(5):
for id in df.id.unique():
col = 'date_{}'.format(i+1)
freq = pd.Timedelta('{}d'.format(i+1))
df.loc[df.id==id, col] = df.loc[df.id==id, 'score'].shift(freq=freq)
df[col] = df[col].fillna(0).astype(int)
This produces same output as other approach on this example, but if you have a skip in the date it will be different. 这将产生与该示例中其他方法相同的输出,但是如果您跳过日期,则将有所不同。
Output: 输出:
id score date_1 date_2 date_3 date_4 date_5
date
2014-01-01 A 75 0 0 0 0 0
2014-01-01 B 1 0 0 0 0 0
2014-01-01 C 2 0 0 0 0 0
2014-01-02 A 84 75 0 0 0 0
2014-01-02 B 1 1 0 0 0 0
2014-01-02 C 3 2 0 0 0 0
2014-01-03 A 1 84 75 0 0 0
2014-01-03 B 1 1 1 0 0 0
2014-01-03 C 1 3 2 0 0 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.