如何将 r 中的数据帧分成相等数量的记录组，并在两个数据帧中随机平均拆分数据

Question

I have some data that contains about 30000 records.我有一些包含大约 30000 条记录的数据。 I want to divide the data into groups of 288 records.我想将数据分成 288 条记录组。 And then sort the data into test_data & train_data separate data frames where first 4 records are stored into train_data while 5th record into test_data, sequentially and randomly.然后将数据排序到 test_data 和 train_data 单独的数据帧中，其中前 4 条记录存储到 train_data 中，而第 5 条记录存储到 test_data 中，顺序和随机。 Randomly in a way any day of the 5 days is saved into test_data & rest 4 goes into train_data. 5 天中的任何一天以某种方式随机保存到 test_data 和 rest 4 进入 train_data。

How this can be achieved?如何做到这一点？

Sampel data:样本数据：

 #   timestamp               var1      var2
    --------------------------------------
 1   01-01-2019 18:00:00      1.2       21
 2   01-01-2019 18:05:00      2.3       32
 3   01-01-2019 18:10:00      3.4       43
 4   01-01-2019 18:15:00      4.5       54
 5   01-01-2019 18:20:00      5.6       65
 . 
 .
 .
3000  ..   -    ..   ..        ..        ..

Sample Output:样品 Output：

#in case of sequencial OR contiguous division  
train_data = (#1,#2,#3,#4 .... #1152,#1441,......,#2592,...) 
test_data = (€253,#254,.....,#1440,.....,#2593,....)

#in case of random division, any 288 contiguous records from bunch of 5 in to #test_data and 4x288 into train_data.

Currently, I have this method of data splitting.目前，我有这种数据拆分方法。

   set.seed(100)

    train <- sample(nrow(dataset1), 0.7 * nrow(dataset1), replace = FALSE)
    TrainSet <- dataset1[train,]
    #scale (TrainSet, center = TRUE, scale = TRUE)
    ValidSet <- dataset1[-train,]
    #scale (ValidSet, center = TRUE, scale = TRUE)
    summary(TrainSet)
    summary(ValidSet)

Answer 1

Here's one way of doing it:这是一种方法：

# assume the number of rows is divisible by 288
num_days = nrow(dataset1)/288

# Each value (True or False) indicates whether the *day* is included or not 
training.days.mask = sample(rep(c(T,T,T,T,F), length.out=num_days))

# To index the actual values, repeat each mask 288 times
training.samples.mask = rep(training.days.mask, each=288)

# now use the mask to extract the data
training.samples = dataset1[training.samples.mask,]
testing.samples = dataset1[!training.samples.mask,]

The idea is to first perform sample on day indices (not samples).这个想法是首先对日索引（而不是样本）执行sample 。 Then, repeat each mask 288 times to capture the sample of a full day.然后，重复每个面具 288 次以捕获一整天的样本。

Answer 2

Does this accomplish what you want?这能实现你想要的吗？

dat$day <- as.Date(timestamp, "%d-%m-%Y")  # Add the day for each observation
days <- unique(dat$day)                    # Get the days since it is the sampling unit
groups <- seq(1, 105, by=5)                # Assuming 30240 observations, 105 days
daystest <- sample(5, length(groups), replace=TRUE) + groups
datetest <- days[daystest]                 # Days in the test set
Testing <- dat[dat$day %in% datetest,]         # Test data set
Training <- dat[!dat$day %in% datetest,]

Testing is a data file of the original data for testing and Training is a datafile of the original data for training. Testing是用于测试的原始数据的数据文件，而Training是用于训练的原始数据的数据文件。 Since you did not include a reproducible sample of your data, I can't test it.由于您没有包含可重现的数据样本，因此我无法对其进行测试。

如何将 r 中的数据帧分成相等数量的记录组，并在两个数据帧中随机平均拆分数据

问题描述

2 个解决方案

解决方案1
0 2019-11-15 19:16:33

解决方案2
0 2019-11-15 19:18:17

如何将 r 中的数据帧分成相等数量的记录组，并在两个数据帧中随机平均拆分数据

问题描述

2 个解决方案

解决方案1 0 2019-11-15 19:16:33

解决方案2 0 2019-11-15 19:18:17

解决方案1
0 2019-11-15 19:16:33

解决方案2
0 2019-11-15 19:18:17