简体   繁体   English

将过去N天的日期行转换为列

[英]Converting date row to column for last N days

I want to build a time series prediction model using features such as week of the year, day of the week, season, etc. 我想使用诸如一年中的一周,一周中的一天,季节等功能来构建时间序列预测模型。

Since the prediction will be highly affected by the most recent values, I want to use the values in the last 5 days, as features, however I am having trouble with data preparation for learning: 由于预测会受到最新值的高度影响,因此我想将过去5天的值用作功能,但是我在准备数据进行学习时遇到了麻烦:

My current table looks like this: 我当前的表如下所示:

    date        id  score
0   2014-01-01  A   75
1   2014-01-01  B   1
2   2014-01-01  C   2
4   2014-01-02  A   84
5   2014-01-02  B   1
6   2014-01-02  C   3
8   2014-01-03  A   1
9   2014-01-03  B   1
10  2014-01-03  C   1

So I want each row to look like this: 所以我希望每一行看起来像这样:

    date        id  score  date_1 date_2 date_3 date_4 date-5
10  2014-01-03  A   1      84     75     0      0      0 
 9  2014-01-03  B   1      1      1      0      0      0

Date_1 is the score of A, the day before its date on 'date' column, date_2 is two days before, and so on... Date_1是A的分数,是“日期”列中日期的前一天,date_2是前两天,依此类推...

So that I can predict the next day, using the information of last 5 days and more features that are irrelevant to this question. 这样我就可以使用最近5天的信息以及与该问题无关的其他功能来预测第二天。 It is OK to fill NaN values with 0 可以用0填充NaN值

You can use groupby(id) and shift . 您可以使用groupby(id)shift You should have your df be sorted by date: df.sort_values('date') before using the following command: 在使用以下命令之前,应df.sort_values('date')日期对df进行排序: df.sort_values('date')

for i in range(5):
    df['date_'+str(i+1)] = df.groupby('id')['score'].shift(i+1).fillna(0).astype(int)

Using the above command yields the following df: 使用上面的命令将产生以下df:

在此处输入图片说明

Time shifting using Timedelta 使用Timedelta进行时移

The other answer is shifting by numeric index. 另一个答案是按数字索引移动。 Works in this instance, but it will break if there are gaps in the dates, or if the dates have not been sorted. 在这种情况下可以使用,但是如果日期中有空格或日期未排序,它将中断。

You can handle this by converting the DataFrame to a time series, then using the freq parameter of DataFrame.shift() with a pandas.Timedelta object. 您可以通过将DataFrame转换为时间序列,然后将DataFrame.shift()freq参数与pandas.Timedelta对象一起使用来处理此问题。

Example data: 示例数据:

import pandas as pd
df = pd.DataFrame({'date': ['2014-01-01'] * 3 +
                           ['2014-01-02'] * 3 +
                           ['2014-01-03'] * 3,
                   'id': ['A', 'B', 'C'] * 3,
                   'score': [75, 1, 2, 84, 1, 3, 1, 1, 1]})
df.date = pd.to_datetime(df.date)
df.set_index('date', inplace=True)

The IDs mean we need a couple of loops to keep everything separate: 这些ID意味着我们需要几个循环来使所有内容分开:

for i in range(5):
    for id in df.id.unique():
        col = 'date_{}'.format(i+1)
        freq = pd.Timedelta('{}d'.format(i+1))
        df.loc[df.id==id, col] = df.loc[df.id==id, 'score'].shift(freq=freq)
    df[col] = df[col].fillna(0).astype(int)

This produces same output as other approach on this example, but if you have a skip in the date it will be different. 这将产生与该示例中其他方法相同的输出,但是如果您跳过日期,则将有所不同。

Output: 输出:

           id  score  date_1  date_2  date_3  date_4  date_5
date                                                        
2014-01-01  A     75       0       0       0       0       0
2014-01-01  B      1       0       0       0       0       0
2014-01-01  C      2       0       0       0       0       0
2014-01-02  A     84      75       0       0       0       0
2014-01-02  B      1       1       0       0       0       0
2014-01-02  C      3       2       0       0       0       0
2014-01-03  A      1      84      75       0       0       0
2014-01-03  B      1       1       1       0       0       0
2014-01-03  C      1       3       2       0       0       0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM