[英]how to split time stamped data as train and test
I have a dataset with timestamped given below:我有一个带有时间戳的数据集,如下所示:
date type, price
1990-01-01, 'A', 100
1990-01-02, 'A', 200
1990-01-03, 'A', 300
1990-01-04, 'A', 400
1990-01-05, 'A', 500
1990-01-06, 'A', 600
1990-01-07, 'A', 700
1990-01-08, 'A', 800
1990-01-09, 'A', 900
1990-01-10, 'A', 1000
1990-01-11, 'B', 1100
1990-01-12, 'B', 1200
1990-01-13, 'B', 1300
1990-01-14, 'B', 1400
1990-01-15, 'B', 1500
I am trying to split this data as train and test with keeping the order based on date
.我正在尝试将这些数据拆分为训练和测试,同时保持基于date
的顺序。 If the split ratio is 0.8 for train and test, the expected output is supposed to be the following data: train_data:如果训练和测试的拆分比为 0.8,则预期的 output 应该是以下数据: train_data:
date type, price
1990-01-01, 'A', 100
1990-01-02, 'A', 200
1990-01-03, 'A', 300
1990-01-04, 'A', 400
1990-01-05, 'A', 500
1990-01-06, 'A', 600
1990-01-07, 'A', 700
1990-01-08, 'A', 800
1990-01-11, 'B', 1100
1990-01-12, 'B', 1200
1990-01-13, 'B', 1300
1990-01-14, 'B', 1400
test_data:测试数据:
date type, price
1990-01-09, 'A', 900
1990-01-10, 'A', 1000
1990-01-15, 'B', 1500
Is there any way to do this in a pythonic way?有什么办法可以用 pythonic 方式做到这一点吗?
transform & transform变换与变换
# grouper
g = df.groupby("type", sort=False).type
# first is 1..size second is [size, size, ...]
sample_nos = g.transform("cumcount").add(1)
group_sizes = g.transform("size")
# belongs to training or not
train_mask = sample_nos <= 0.8 * group_sizes
# then choose so
train_data = df[train_mask].copy()
test_data = df[~train_mask].copy()
train_data
date type price
0 1990-01-01 'A' 100
1 1990-01-02 'A' 200
2 1990-01-03 'A' 300
3 1990-01-04 'A' 400
4 1990-01-05 'A' 500
5 1990-01-06 'A' 600
6 1990-01-07 'A' 700
7 1990-01-08 'A' 800
10 1990-01-11 'B' 1100
11 1990-01-12 'B' 1200
12 1990-01-13 'B' 1300
13 1990-01-14 'B' 1400
and和
test_data
date type price
8 1990-01-09 'A' 900
9 1990-01-10 'A' 1000
14 1990-01-15 'B' 1500
You can use groupby
and apply
methods to split the data.您可以使用groupby
和apply
方法来拆分数据。
import io
import pandas as pd
# Create sample data as string
s = '''date,type,price
1990-01-01,A,100
1990-01-02,A,200
1990-01-03,A,300
1990-01-04,A,400
1990-01-05,A,500
1990-01-06,A,600
1990-01-07,A,700
1990-01-08,A,800
1990-01-09,A,900
1990-01-10,A,1000
1990-01-11,B,1100
1990-01-12,B,1200
1990-01-13,B,1300
1990-01-14,B,1400
1990-01-15,B,1500'''
# Read the sample
df = pd.read_csv(io.StringIO(s))
# Ensure that df is sorted by date at least
df = df.sort_values(['type', 'date']).reset_index(drop=True)
# Split df into train and test dataframes
split_ratio = 0.8
train_data = df.groupby('type', group_keys=False).apply(lambda df: df.head(int(split_ratio * len(df))))
test_data = df[~df.index.isin(train_data.index)]
date日期 | type类型 | price价格 | |
---|---|---|---|
0 0 | 1990-01-01 1990-01-01 | A一种 | 100 100 |
1 1个 | 1990-01-02 1990-01-02 | A一种 | 200 200 |
2 2个 | 1990-01-03 1990-01-03 | A一种 | 300 300 |
3 3个 | 1990-01-04 1990-01-04 | A一种 | 400 400 |
4 4个 | 1990-01-05 1990-01-05 | A一种 | 500 500 |
5 5个 | 1990-01-06 1990-01-06 | A一种 | 600 600 |
6 6个 | 1990-01-07 1990-01-07 | A一种 | 700 700 |
7 7 | 1990-01-08 1990-01-08 | A一种 | 800 800 |
10 10 | 1990-01-11 1990-01-11 | B乙 | 1100 1100 |
11 11 | 1990-01-12 1990-01-12 | B乙 | 1200 1200 |
12 12 | 1990-01-13 1990-01-13 | B乙 | 1300 1300 |
13 13 | 1990-01-14 1990-01-14 | B乙 | 1400 1400 |
date日期 | type类型 | price价格 | |
---|---|---|---|
8 8个 | 1990-01-09 1990-01-09 | A一种 | 900 900 |
9 9 | 1990-01-10 1990-01-10 | A一种 | 1000 1000 |
14 14 | 1990-01-15 1990-01-15 | B乙 | 1500 1500 |
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.