简体   繁体   English

如何将时间戳数据拆分为训练和测试

[英]how to split time stamped data as train and test

I have a dataset with timestamped given below:我有一个带有时间戳的数据集,如下所示:

date        type, price
1990-01-01, 'A', 100
1990-01-02, 'A', 200
1990-01-03, 'A', 300
1990-01-04, 'A', 400
1990-01-05, 'A', 500
1990-01-06, 'A', 600
1990-01-07, 'A', 700
1990-01-08, 'A', 800
1990-01-09, 'A', 900
1990-01-10, 'A', 1000
1990-01-11, 'B', 1100
1990-01-12, 'B', 1200
1990-01-13, 'B', 1300
1990-01-14, 'B', 1400
1990-01-15, 'B', 1500

I am trying to split this data as train and test with keeping the order based on date .我正在尝试将这些数据拆分为训练和测试,同时保持基于date的顺序。 If the split ratio is 0.8 for train and test, the expected output is supposed to be the following data: train_data:如果训练和测试的拆分比为 0.8,则预期的 output 应该是以下数据: train_data:

date        type, price
1990-01-01, 'A', 100
1990-01-02, 'A', 200
1990-01-03, 'A', 300
1990-01-04, 'A', 400
1990-01-05, 'A', 500
1990-01-06, 'A', 600
1990-01-07, 'A', 700
1990-01-08, 'A', 800
1990-01-11, 'B', 1100
1990-01-12, 'B', 1200
1990-01-13, 'B', 1300
1990-01-14, 'B', 1400

test_data:测试数据:

    date        type, price
    1990-01-09, 'A', 900
    1990-01-10, 'A', 1000
    1990-01-15, 'B', 1500

Is there any way to do this in a pythonic way?有什么办法可以用 pythonic 方式做到这一点吗?

transform & transform变换与变换

# grouper
g = df.groupby("type", sort=False).type

# first is 1..size second is [size, size, ...]
sample_nos  = g.transform("cumcount").add(1)
group_sizes = g.transform("size")

# belongs to training or not
train_mask = sample_nos <= 0.8 * group_sizes

# then choose so
train_data = df[train_mask].copy()
test_data  = df[~train_mask].copy()
train_data

          date type  price
0   1990-01-01  'A'    100
1   1990-01-02  'A'    200
2   1990-01-03  'A'    300
3   1990-01-04  'A'    400
4   1990-01-05  'A'    500
5   1990-01-06  'A'    600
6   1990-01-07  'A'    700
7   1990-01-08  'A'    800
10  1990-01-11  'B'   1100
11  1990-01-12  'B'   1200
12  1990-01-13  'B'   1300
13  1990-01-14  'B'   1400

and

test_data

          date type  price
8   1990-01-09  'A'    900
9   1990-01-10  'A'   1000
14  1990-01-15  'B'   1500

You can use groupby and apply methods to split the data.您可以使用groupbyapply方法来拆分数据。

Code:代码:

import io
import pandas as pd

# Create sample data as string
s = '''date,type,price
1990-01-01,A,100
1990-01-02,A,200
1990-01-03,A,300
1990-01-04,A,400
1990-01-05,A,500
1990-01-06,A,600
1990-01-07,A,700
1990-01-08,A,800
1990-01-09,A,900
1990-01-10,A,1000
1990-01-11,B,1100
1990-01-12,B,1200
1990-01-13,B,1300
1990-01-14,B,1400
1990-01-15,B,1500'''

# Read the sample
df = pd.read_csv(io.StringIO(s))

# Ensure that df is sorted by date at least
df = df.sort_values(['type', 'date']).reset_index(drop=True)

# Split df into train and test dataframes
split_ratio = 0.8
train_data = df.groupby('type', group_keys=False).apply(lambda df: df.head(int(split_ratio * len(df))))
test_data = df[~df.index.isin(train_data.index)]

Output: Output:

# train_data: # 训练数据:

date日期 type类型 price价格
0 0 1990-01-01 1990-01-01 A一种 100 100
1 1个 1990-01-02 1990-01-02 A一种 200 200
2 2个 1990-01-03 1990-01-03 A一种 300 300
3 3个 1990-01-04 1990-01-04 A一种 400 400
4 4个 1990-01-05 1990-01-05 A一种 500 500
5 5个 1990-01-06 1990-01-06 A一种 600 600
6 6个 1990-01-07 1990-01-07 A一种 700 700
7 7 1990-01-08 1990-01-08 A一种 800 800
10 10 1990-01-11 1990-01-11 B 1100 1100
11 11 1990-01-12 1990-01-12 B 1200 1200
12 12 1990-01-13 1990-01-13 B 1300 1300
13 13 1990-01-14 1990-01-14 B 1400 1400

# test_data: # 测试数据:

date日期 type类型 price价格
8 8个 1990-01-09 1990-01-09 A一种 900 900
9 9 1990-01-10 1990-01-10 A一种 1000 1000
14 14 1990-01-15 1990-01-15 B 1500 1500

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM