如何将时间戳数据拆分为训练和测试

Question

I have a dataset with timestamped given below:我有一个带有时间戳的数据集，如下所示：

date        type, price
1990-01-01, 'A', 100
1990-01-02, 'A', 200
1990-01-03, 'A', 300
1990-01-04, 'A', 400
1990-01-05, 'A', 500
1990-01-06, 'A', 600
1990-01-07, 'A', 700
1990-01-08, 'A', 800
1990-01-09, 'A', 900
1990-01-10, 'A', 1000
1990-01-11, 'B', 1100
1990-01-12, 'B', 1200
1990-01-13, 'B', 1300
1990-01-14, 'B', 1400
1990-01-15, 'B', 1500

I am trying to split this data as train and test with keeping the order based on date .我正在尝试将这些数据拆分为训练和测试，同时保持基于date的顺序。 If the split ratio is 0.8 for train and test, the expected output is supposed to be the following data: train_data:如果训练和测试的拆分比为 0.8，则预期的 output 应该是以下数据： train_data：

date        type, price
1990-01-01, 'A', 100
1990-01-02, 'A', 200
1990-01-03, 'A', 300
1990-01-04, 'A', 400
1990-01-05, 'A', 500
1990-01-06, 'A', 600
1990-01-07, 'A', 700
1990-01-08, 'A', 800
1990-01-11, 'B', 1100
1990-01-12, 'B', 1200
1990-01-13, 'B', 1300
1990-01-14, 'B', 1400

test_data:测试数据：

    date        type, price
    1990-01-09, 'A', 900
    1990-01-10, 'A', 1000
    1990-01-15, 'B', 1500

Is there any way to do this in a pythonic way?有什么办法可以用 pythonic 方式做到这一点吗？

Answer 1

transform & transform变换与变换

# grouper
g = df.groupby("type", sort=False).type

# first is 1..size second is [size, size, ...]
sample_nos  = g.transform("cumcount").add(1)
group_sizes = g.transform("size")

# belongs to training or not
train_mask = sample_nos <= 0.8 * group_sizes

# then choose so
train_data = df[train_mask].copy()
test_data  = df[~train_mask].copy()

train_data

          date type  price
0   1990-01-01  'A'    100
1   1990-01-02  'A'    200
2   1990-01-03  'A'    300
3   1990-01-04  'A'    400
4   1990-01-05  'A'    500
5   1990-01-06  'A'    600
6   1990-01-07  'A'    700
7   1990-01-08  'A'    800
10  1990-01-11  'B'   1100
11  1990-01-12  'B'   1200
12  1990-01-13  'B'   1300
13  1990-01-14  'B'   1400

and和

test_data

          date type  price
8   1990-01-09  'A'    900
9   1990-01-10  'A'   1000
14  1990-01-15  'B'   1500

Answer 2

You can use groupby and apply methods to split the data.您可以使用groupby和apply方法来拆分数据。

Code:代码：

import io
import pandas as pd

# Create sample data as string
s = '''date,type,price
1990-01-01,A,100
1990-01-02,A,200
1990-01-03,A,300
1990-01-04,A,400
1990-01-05,A,500
1990-01-06,A,600
1990-01-07,A,700
1990-01-08,A,800
1990-01-09,A,900
1990-01-10,A,1000
1990-01-11,B,1100
1990-01-12,B,1200
1990-01-13,B,1300
1990-01-14,B,1400
1990-01-15,B,1500'''

# Read the sample
df = pd.read_csv(io.StringIO(s))

# Ensure that df is sorted by date at least
df = df.sort_values(['type', 'date']).reset_index(drop=True)

# Split df into train and test dataframes
split_ratio = 0.8
train_data = df.groupby('type', group_keys=False).apply(lambda df: df.head(int(split_ratio * len(df))))
test_data = df[~df.index.isin(train_data.index)]

Output: Output：

# train_data: # 训练数据：

	date日期	type类型	price价格
0 0	1990-01-01 1990-01-01	A一种	100 100
1 1个	1990-01-02 1990-01-02	A一种	200 200
2 2个	1990-01-03 1990-01-03	A一种	300 300
3 3个	1990-01-04 1990-01-04	A一种	400 400
4 4个	1990-01-05 1990-01-05	A一种	500 500
5 5个	1990-01-06 1990-01-06	A一种	600 600
6 6个	1990-01-07 1990-01-07	A一种	700 700
7 7	1990-01-08 1990-01-08	A一种	800 800
10 10	1990-01-11 1990-01-11	B乙	1100 1100
11 11	1990-01-12 1990-01-12	B乙	1200 1200
12 12	1990-01-13 1990-01-13	B乙	1300 1300
13 13	1990-01-14 1990-01-14	B乙	1400 1400

# test_data: ＃测试数据：

	date日期	type类型	price价格
8 8个	1990-01-09 1990-01-09	A一种	900 900
9 9	1990-01-10 1990-01-10	A一种	1000 1000
14 14	1990-01-15 1990-01-15	B乙	1500 1500

如何将时间戳数据拆分为训练和测试

问题描述

2 个解决方案

解决方案1
0

解决方案2
0 2022-02-13 04:04:12

Code:代码：

Output: Output：

# train_data: # 训练数据：

# test_data: ＃测试数据：

如何将时间戳数据拆分为训练和测试

问题描述

2 个解决方案

解决方案1 0

解决方案2 0 2022-02-13 04:04:12

Code:代码：

Output: Output：

# train_data: # 训练数据：

# test_data: ＃ 测试数据：

解决方案1
0

解决方案2
0 2022-02-13 04:04:12

# test_data: ＃测试数据：