简体   繁体   English

从 pandas dataframe 创建滑动 window 组

[英]Creating sliding window groups from pandas dataframe

I'm trying to pre-process my data for a ML regression problem.我正在尝试为 ML 回归问题预处理我的数据。
With the following (simplified) data frame:使用以下(简化的)数据框:

   grp day  score
0    A   1      2
1    A   1      4
2    A   2      6
3    A   2      8
4    A   3     10
5    A   3     12
6    A   4     14
7    A   4     16
8    A   5     18
9    A   5     20
10   B   1      2
11   B   2      4
12   B   3      8
13   B   4     16
14   B   5     32

I'm trying to create a list of 'sliding window' sequences based on the day column, so if I have X days, the first 2 days will have a target of the score Y days ahed.我正在尝试根据日期列创建“滑动窗口”序列列表,因此如果我有X天,前 2 天将有一个分数Y天的目标。

In the example bellow I have 5 days in each group and for each 2 days I'm looking at the target of 2 days ahead, stopping when I've reached the end of the data frame:在下面的示例中,我每组有 5 天,每 2 天我正在查看提前 2 天的目标,当我到达数据框的末尾时停止:

在此处输入图像描述

So for example here are the 2 first groups for group A:例如,这里是 A 组的前 2 个组:

   grp day  score   target
0    A   1      2    16
1    A   1      4    16
2    A   2      6    16
3    A   2      8    16 <- last score value of day 4 (group A)

   grp day  score   target
0    A   2      6    20
1    A   2      8    20
2    A   3      10   20
3    A   3      12   20 <- last score value of day 5 (group A)

And for group B:对于 B 组:

   grp day  score   target
10   B   1      2    16
11   B   2      4    16 <- last score value of day 4 (group B)

   grp day  score   target
10   B   2      4    32
11   B   3      8    32 <- last score value of day 5 (group B)

I've use factorize to get the days index and group like so:我已经使用factorize来获取天索引和组,如下所示:

groups = df.groupby(['grp'])
for _,grp in groups:
  days_row_index = grp['day'].factorize()[0]
  days_group = grp.groupby(days_row_index)
  ...

But I'm a bit lost... any help would be appreciated Update:但我有点失落......任何帮助将不胜感激更新:

I've written the following clumsy code, to get me going... how can I improve it?我写了以下笨拙的代码,让我继续......我该如何改进它?

import pandas as pd
df = pd.DataFrame({'grp':['A','A','A','A','A','A','A','A','A','A','B','B','B','B','B'],
                   'day':['1','1','2','2','3','3','4','4','5','5','1','2','3','4','5'],
                   'score':[2,4,6,8,10,12,14,16,18,20,2,4,8,16,32]
                   })

print(df.head(15))

df2 = pd.DataFrame({'grp':[],
                    'day':[],
                    'score':[]})

groups = df.groupby(['grp'])
GROUP_SIZE = 2
LOOK_AHEAD = 2
sequences = []

for _,grp in groups:
  days_row_index = grp['day'].factorize()[0]
  days_group = grp.groupby(days_row_index)
  for _,day in days_group:
    day_index = int(day['day'].values[0])
    if day_index + LOOK_AHEAD < len(days_group):
      target = days_group.get_group(day_index + LOOK_AHEAD)['score'].values[-1]
      print(day_index,day_index + LOOK_AHEAD,day['score'].values[-1],"----------->",target)
      day['target'] = target
      df2 = pd.concat([df2,day])
      for i in range(0, GROUP_SIZE-1):
        if day_index + i >= len(days_group):
          break
        next_day = days_group.get_group(day_index + i)
        next_day['target'] = target
        df2 = pd.concat([df2,next_day])
      sequences.append(df2.copy())
      df2 = df2.iloc[0:0]
sequences

Building upon your proposed solution I wrote this little piece of code, which I'm positive it can be optimized so feel free anyone to improve it.在您提出的解决方案的基础上,我编写了一小段代码,我很肯定它可以进行优化,因此请随时改进它。 Let me know if this was what you were looking for (I took the liberty to create another "mixed" group 'C' to test a more generalized approach).让我知道这是否是您正在寻找的(我冒昧地创建了另一个“混合”组“C”来测试更通用的方法)。

import pandas as pd

# Create test dataframe
df = [
     ['A', 1, 2],
     ['A', 1, 4],
     ['A', 2, 6],
     ['A', 2, 8],
     ['A', 3, 10],
     ['A', 3, 12],
     ['A', 4, 14],
     ['A', 4, 16],
     ['A', 5, 18],
     ['A', 5, 20],
     ['B', 1, 2],
     ['B', 2, 4],
     ['B', 3, 8],
     ['B', 4, 16],
     ['B', 5, 32],
     ['C', 1, 2],
     ['C', 1, 4],
     ['C', 2, 8],
     ['C', 3, 16],
     ['C', 3, 20],
     ['C', 4, 24],
     ['C', 5, 28]
     ]
df = pd.DataFrame(df, columns = ['grp', 'day', 'score'])

# Processing
groups = df.groupby(['grp'])
for _,grp in groups:
  days_row_index = grp['day'].factorize()[0]
  i = min(days_row_index)
  while i < max(days_row_index) - 2:
      idx = (days_row_index == i) | (days_row_index == i + 1)
      # Create list of targets for every subgroup
      print([grp['score'].values[days_row_index == i + 3][-1]]*sum(idx))
      i += 1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM