简体   繁体   English

如何将 r 中的数据帧分成相等数量的记录组,并在两个数据帧中随机平均拆分数据

[英]How to divide data frame in r in groups of equal number of records and split dat randomly & equally in two data frames

I have some data that contains about 30000 records.我有一些包含大约 30000 条记录的数据。 I want to divide the data into groups of 288 records.我想将数据分成 288 条记录组。 And then sort the data into test_data & train_data separate data frames where first 4 records are stored into train_data while 5th record into test_data, sequentially and randomly.然后将数据排序到 test_data 和 train_data 单独的数据帧中,其中前 4 条记录存储到 train_data 中,而第 5 条记录存储到 test_data 中,顺序和随机。 Randomly in a way any day of the 5 days is saved into test_data & rest 4 goes into train_data. 5 天中的任何一天以某种方式随机保存到 test_data 和 rest 4 进入 train_data。

How this can be achieved?如何做到这一点?

Sampel data:样本数据:

 #   timestamp               var1      var2
    --------------------------------------
 1   01-01-2019 18:00:00      1.2       21
 2   01-01-2019 18:05:00      2.3       32
 3   01-01-2019 18:10:00      3.4       43
 4   01-01-2019 18:15:00      4.5       54
 5   01-01-2019 18:20:00      5.6       65
 . 
 .
 .
3000  ..   -    ..   ..        ..        ..

Sample Output:样品 Output:

#in case of sequencial OR contiguous division  
train_data = (#1,#2,#3,#4 .... #1152,#1441,......,#2592,...) 
test_data = (€253,#254,.....,#1440,.....,#2593,....)

#in case of random division, any 288 contiguous records from bunch of 5 in to #test_data and 4x288 into train_data.

Currently, I have this method of data splitting.目前,我有这种数据拆分方法。

   set.seed(100)

    train <- sample(nrow(dataset1), 0.7 * nrow(dataset1), replace = FALSE)
    TrainSet <- dataset1[train,]
    #scale (TrainSet, center = TRUE, scale = TRUE)
    ValidSet <- dataset1[-train,]
    #scale (ValidSet, center = TRUE, scale = TRUE)
    summary(TrainSet)
    summary(ValidSet)

Here's one way of doing it:这是一种方法:

# assume the number of rows is divisible by 288
num_days = nrow(dataset1)/288

# Each value (True or False) indicates whether the *day* is included or not 
training.days.mask = sample(rep(c(T,T,T,T,F), length.out=num_days))

# To index the actual values, repeat each mask 288 times
training.samples.mask = rep(training.days.mask, each=288)

# now use the mask to extract the data
training.samples = dataset1[training.samples.mask,]
testing.samples = dataset1[!training.samples.mask,]

The idea is to first perform sample on day indices (not samples).这个想法是首先对日索引(而不是样本)执行sample Then, repeat each mask 288 times to capture the sample of a full day.然后,重复每个面具 288 次以捕获一整天的样本。

Does this accomplish what you want?这能实现你想要的吗?

dat$day <- as.Date(timestamp, "%d-%m-%Y")  # Add the day for each observation
days <- unique(dat$day)                    # Get the days since it is the sampling unit
groups <- seq(1, 105, by=5)                # Assuming 30240 observations, 105 days
daystest <- sample(5, length(groups), replace=TRUE) + groups
datetest <- days[daystest]                 # Days in the test set
Testing <- dat[dat$day %in% datetest,]         # Test data set
Training <- dat[!dat$day %in% datetest,]

Testing is a data file of the original data for testing and Training is a datafile of the original data for training. Testing是用于测试的原始数据的数据文件,而Training是用于训练的原始数据的数据文件。 Since you did not include a reproducible sample of your data, I can't test it.由于您没有包含可重现的数据样本,因此我无法对其进行测试。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM