简体   繁体   English

如何根据某些条件拆分训练和测试数据?

[英]How can I split train and test data based on some conditions?

How can I split train and test data based on some conditions for the machine learning models?如何根据机器学习模型的某些条件拆分训练和测试数据? The test data should include the same spatial areas (xy) for each year.测试数据应包括每年相同的空间区域 (xy)。 Namely, I don't want the same spatial area to be in the training and test set.也就是说,我不希望训练和测试集中存在相同的空间区域。 For example:例如:

import pandas as pd
data = {'x': [ 80.1, 90.1, 0, 300.1, 80.1, 90.1, 0, 300.1, 80.1, 90.1, 0, 300.1], 'y': [ 140.1, 150.1, 160.1, 400.1, 140.1, 150.1, 160.1, 400.1, 140.1, 150.1, 160.1, 400.1], 'a': [1, 2, 3, 4, 5, 10, 11, 12, 13, 14, 15, 16], 'c': [0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0], 'year': [2000, 2000, 2000, 2000, 2001, 2001, 2001, 2001, 2002, 2002, 2002, 2002]}   
df = pd.DataFrame(data)
df
            
             x        y     a    c      year
        
        0   80.1    140.1   1   0.0     2000
        1   90.1    150.1   2   0.0     2000
        2   0.0     160.1   3   0.0     2000
        3   300.1   400.1   4   0.0     2000
        4   80.1    140.1   5   0.0     2001
        5   90.1    150.1   10  0.0     2001
        6   0.0     160.1   11  1.0     2001
        7   300.1   400.1   12  0.0     2001
        8   80.1    140.1   13  1.0     2002
        9   90.1    150.1   14  1.0     2002
        10  0.0     160.1   15  0.0     2002
        11  300.1   400.1   16  0.0     2002

    Expected train dataset:          
                  x       y     a      c     year   
            
            0   80.1    140.1   1     0.0    2000  
            1   90.1    150.1   2     0.0    2000   
             
            3   300.1   400.1   4     0.0    2000  
            4   80.1    140.1   5     0.0    2001  
            5   90.1    150.1   10    0.0    2001  
             
            7   300.1   400.1   12    0.0    2001  
            8   80.1    140.1   13    1.0    2002  
            9   90.1    150.1   14    1.0    2002   
            
            11  300.1   400.1   16    0.0    2002   
    
    Expected test dataset:           
                  x       y     a      c     year   
                           
            2   0.0     160.1   3     0.0    2000 
            
            6   0.0     160.1   11    1.0    2001  
             
            10  0.0     160.1   15    0.0    2002  
              
import numpy as np
import pandas as pd
data = {'x': [ 80.1, 90.1, 0, 300.1, 80.1, 90.1, 0, 300.1, 80.1, 90.1, 0, 300.1], 'y': [ 140.1, 150.1, 160.1, 400.1, 140.1, 150.1, 160.1, 400.1, 140.1, 150.1, 160.1, 400.1], 'a': [1, 2, 3, 4, 5, 10, 11, 12, 13, 14, 15, 16], 'c': [0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0], 'year': [2000, 2000, 2000, 2000, 2001, 2001, 2001, 2001, 2002, 2002, 2002, 2002]}   

df1 = pd.DataFrame(data)

test_data = df1.loc[range(2, 12, 4)]
training_data = df1[~df1.isin(test_data)].dropna()

You can create a group infer by slicing by 0 in x column您可以通过在x列中按0进行切片来创建组推断

m = df.loc[::-1, 'x'].eq(0).cumsum()[::-1]
print(m)

0     3
1     3
2     3
3     2
4     2
5     2
6     2
7     1
8     1
9     1
10    1
11    0
Name: x, dtype: int64

Then group with this infer然后用这个推断分组

df_train = df.groupby(m).apply(lambda group: group[group['x'].ne(0)])
          x      y   a  c  year
x
0 11  300.1  400.1  16  0  2002
1 7   300.1  400.1  12  0  2001
  8    80.1  140.1  13  1  2002
  9    90.1  150.1  14  1  2002
2 3   300.1  400.1   4  0  2000
  4    80.1  140.1   5  0  2001
  5    90.1  150.1  10  0  2001
3 0    80.1  140.1   1  0  2000
  1    90.1  150.1   2  0  2000
df_test = df.groupby(m).apply(lambda group: group[group['x'].eq(0)])
        x      y   a  c  year
x
1 10  0.0  160.1  15  0  2002
2 6   0.0  160.1  11  1  2001
3 2   0.0  160.1   3  0  2000

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在不使用 function train_test_split 的情况下将数据拆分为测试和训练? - How can I split the data into test and train without using function train_test_split? 如何根据标签训练/测试/拆分数据? - How to train/test/split data based on labels? 如何根据列值将数据拆分为训练和测试并混洗组合? - how to split data into train and test based on a column values and shuffle the combinations? 如何使用 sklearn 中的 train_test_split 确保用户和项目同时出现在训练和测试数据集中? - How can I ensure that the users and items appear in both train and test data set with train_test_split in sklearn? 如何从数据集中拆分训练、测试、有效数据并将其存储在 pickle 中 - How can i split the train, test, valid data from datasets and store it in pickle 如何将此数据集拆分为训练集、验证集和测试集? - How can I split this dataset into train, validation, and test set? 如何正确拆分不平衡数据集以训练和测试集? - How can I properly split imbalanced dataset to train and test set? 拆分训练/测试数据 - Split train / test data 如何将时间戳数据拆分为训练和测试 - how to split time stamped data as train and test 如何使用 tensorflow 将数据拆分为测试和训练 - how to split data into test and train using tensorflow
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM