简体   繁体   中英

Repeated Sampling from groups in a dataframe and applying a function

This is a combination of two questions ( Repeat the re-sampling function for 1000 times ? Using lapply? and How do you sample groups in a data.table with a caveat ).

The goal is to sample groups in a data.table, but repeat this process "n" times and pull the average for each row-value. For example:

#generate the data
DT = data.table(a=c(1,1,1,1:15,1,1), b=sample(1:1000,20))

#sample the data as done in the second linked question
DT[,.SD[sample(.N,min(.N,3))],by = a]
     a   b
 1:  1 288
 2:  1 881
 3:  1 409
 4:  2 937
 5:  3  46
 6:  4 525
 7:  5 887
 8:  6 548
 9:  7 453
10:  8 948
11:  9 449
12: 10 670
13: 11 566
14: 12 102
15: 13 993
16: 14 243
17: 15  42

Now here is my attempt using the answer given in the first-linked question:

x <- replicate(100,{DT[,.SD[sample(.N,min(.N,3))],by = a]})

This returns a list "x" with each repetition. The only way I can think of accessing the repetitions is by this:

# repetition 1 col-a values
x[[1]]
# repetition 1 col-b values
x[[2]]
# repetition 2 col-a values
x[[3]]
# repetition 2 col-b values
x[[4]]

So in order to achieve the average for each row, I would have to find the mean of x[[j]] where j goes from seq(2,200,2) where 200 is the number of replications*2.

Is there an easier way of doing this? I have tried using this solution ( https://stats.stackexchange.com/questions/8225/how-to-summarize-data-by-group-in-r ) in this fashion:

y <- DT[,.SD[sample(.N,min(.N,3))],by = a]
y[,list(mean=mean(b)),by=a]
     a mean
 1:  1  550
 2:  2  849
 3:  3  603
 4:  4   77
 5:  5  973
 6:  6  746
 7:  7  919
 8:  8  655
 9:  9  883
10: 10  823
11: 11  533
12: 12  483
13: 13   53
14: 14  827
15: 15  413

But I have yet to be able to do this with the replication process. Any help would be great!

Something like this??

Based on your comments, you want means by group for each replicate , so in this example 15 * 100 means. Here are two ways to do that.

library(data.table)
set.seed(1) # for reproducibility
DT = data.table(a=c(1,1,1,1:15,1,1), b=sample(1:1000,20))
x <- replicate(100,{DT[,.SD[sample(.N,min(.N,3))],by = a]})

indx <- seq(1,length(x),2)
result.1 <- mapply(function(a,b)aggregate(b,list(a),mean)$x,x[indx],x[indx+1])
str(result.1)
#  num [1:15, 1:100] 569 201 894 940 657 625 62 204 175 679 ...
result.2 <- sapply(x[indx+1],function(b)aggregate(b,x[1],mean)$x)
identical(result.1,result.2)
# [1] TRUE

Both methods produce a 15 X 100 matrix of means, with the groups in rows and the replicates in columns. The second approach takes advantage of fact the a column is the same for all replicates.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM