[英]How to divide data frame in r in groups of equal number of records and split dat randomly & equally in two data frames
I have some data that contains about 30000 records.我有一些包含大约 30000 条记录的数据。 I want to divide the data into groups of 288 records.
我想将数据分成 288 条记录组。 And then sort the data into test_data & train_data separate data frames where first 4 records are stored into train_data while 5th record into test_data, sequentially and randomly.
然后将数据排序到 test_data 和 train_data 单独的数据帧中,其中前 4 条记录存储到 train_data 中,而第 5 条记录存储到 test_data 中,顺序和随机。 Randomly in a way any day of the 5 days is saved into test_data & rest 4 goes into train_data.
5 天中的任何一天以某种方式随机保存到 test_data 和 rest 4 进入 train_data。
How this can be achieved?如何做到这一点?
Sampel data:样本数据:
# timestamp var1 var2
--------------------------------------
1 01-01-2019 18:00:00 1.2 21
2 01-01-2019 18:05:00 2.3 32
3 01-01-2019 18:10:00 3.4 43
4 01-01-2019 18:15:00 4.5 54
5 01-01-2019 18:20:00 5.6 65
.
.
.
3000 .. - .. .. .. ..
Sample Output:样品 Output:
#in case of sequencial OR contiguous division
train_data = (#1,#2,#3,#4 .... #1152,#1441,......,#2592,...)
test_data = (€253,#254,.....,#1440,.....,#2593,....)
#in case of random division, any 288 contiguous records from bunch of 5 in to #test_data and 4x288 into train_data.
Currently, I have this method of data splitting.目前,我有这种数据拆分方法。
set.seed(100)
train <- sample(nrow(dataset1), 0.7 * nrow(dataset1), replace = FALSE)
TrainSet <- dataset1[train,]
#scale (TrainSet, center = TRUE, scale = TRUE)
ValidSet <- dataset1[-train,]
#scale (ValidSet, center = TRUE, scale = TRUE)
summary(TrainSet)
summary(ValidSet)
Here's one way of doing it:这是一种方法:
# assume the number of rows is divisible by 288
num_days = nrow(dataset1)/288
# Each value (True or False) indicates whether the *day* is included or not
training.days.mask = sample(rep(c(T,T,T,T,F), length.out=num_days))
# To index the actual values, repeat each mask 288 times
training.samples.mask = rep(training.days.mask, each=288)
# now use the mask to extract the data
training.samples = dataset1[training.samples.mask,]
testing.samples = dataset1[!training.samples.mask,]
The idea is to first perform sample
on day indices (not samples).这个想法是首先对日索引(而不是样本)执行
sample
。 Then, repeat each mask 288 times to capture the sample of a full day.然后,重复每个面具 288 次以捕获一整天的样本。
Does this accomplish what you want?这能实现你想要的吗?
dat$day <- as.Date(timestamp, "%d-%m-%Y") # Add the day for each observation
days <- unique(dat$day) # Get the days since it is the sampling unit
groups <- seq(1, 105, by=5) # Assuming 30240 observations, 105 days
daystest <- sample(5, length(groups), replace=TRUE) + groups
datetest <- days[daystest] # Days in the test set
Testing <- dat[dat$day %in% datetest,] # Test data set
Training <- dat[!dat$day %in% datetest,]
Testing
is a data file of the original data for testing and Training
is a datafile of the original data for training. Testing
是用于测试的原始数据的数据文件,而Training
是用于训练的原始数据的数据文件。 Since you did not include a reproducible sample of your data, I can't test it.由于您没有包含可重现的数据样本,因此我无法对其进行测试。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.