简体   繁体   English

基于列组的列的随机抽样

[英]random sampling of columns based on column group

I have a simple problem which can be solved in a dirty way, but I'm looking for a clean way using data.table 我有一个简单的问题,可以用一种肮脏的方式解决,但我正在寻找一种使用data.table的简洁方法

I have the following data.table with n columns belonging to m unequal groups. 我有以下data.table其中n列属于m个不等组。 Here is an example of my data.table: 这是我的data.table的一个例子:

dframe   <- as.data.frame(matrix(rnorm(60), ncol=30))
cletters <- rep(c("A","B","C"), times=c(10,14,6))
colnames(dframe) <- cletters


           A           A          A           A           A          A
1 -0.7431185 -0.06356047 -0.2247782 -0.15423889 -0.03894069  0.1165187
2 -1.5891905 -0.44468389 -0.1186977  0.02270782 -0.64950716 -0.6844163
          A         A          A          A         B         B          B
1 -1.277307 1.8164195 -0.3957006 -0.6489105 0.3498384 -0.463272  0.8458673
2 -1.644389 0.6360258  0.5612634  0.3559574 1.9658743  1.858222 -1.4502839
           B          B          B         B          B           B          B
1  0.3167216 -0.2919079  0.5146733 0.6628149  0.5481958 -0.01721261 -0.5986918
2 -0.8104386  1.2335948 -0.6837159 0.4735597 -0.4686109  0.02647807  0.6389771
           B          B           B          B          C           C
1 -1.2980799  0.3834073 -0.04559749  0.8715914  1.1619585 -1.26236232
2 -0.3551722 -0.6587208  0.44822253 -0.1943887 -0.4958392  0.09581703
           C          C          C         C
1 -0.1387091 -0.4638417 -2.3897681 0.6853864
2  0.1680119 -0.5990310  0.9779425 1.0819789

What I want to do is to take a random subset of the columns (of a sepcific size), keeping the same number of columns per group (if the chosen sample size is larger than the number of columns belonging to one group, take all of the columns of this group). 我想要做的是采用列的随机子集(特定大小),每组保持相同的列数(如果选择的样本大小大于属于一个组的列数,则采取全部这个组的专栏)。

I have tried an updated version of the method mentioned in this question: 我试过这个问题中提到的方法的更新版本:

sample rows of subgroups from dataframe with dplyr 使用dplyr从dataframe中抽取子组行

but I'm not able to map the column names to the by argument. 但是我无法将列名映射到by参数。

Can someone help me with this? 有人可以帮我弄这个吗?

Here's another approach, IIUC: 这是另一种方法,IIUC:

idx <- split(seq_along(dframe), names(dframe))
keep <- unlist(Map(sample, idx, pmin(7, lengths(idx))))

dframe[, keep]

Explanation: 说明:

The first step splits the column indices according to the column names: 第一步根据列名拆分列索引:

idx
# $A
# [1]  1  2  3  4  5  6  7  8  9 10
# 
# $B
# [1] 11 12 13 14 15 16 17 18 19 20 21 22 23 24
# 
# $C
# [1] 25 26 27 28 29 30

In the next step we use 在下一步我们使用

pmin(7, lengths(idx))
#[1] 7 7 6

to determine the sample size in each group and apply this to each list element (group) in idx using Map . 确定每个组中的样本大小,并使用Map应用于idx每个列表元素(组)。 We then unlist the result to get a single vector of column indices. 然后,我们将结果取消列表以获得列索引的单个向量。

Not sure if you want a solution with dplyr , but here's one with just lapply : 不确定你是否想要一个带有dplyr的解决方案,但这里只有lapply

dframe   <- as.data.frame(matrix(rnorm(60), ncol=30))
cletters <- rep(c("A","B","C"), times=c(10,14,6))
colnames(dframe) <- cletters

# Number of columns to sample per group
nc <- 8


res <- do.call(cbind,
       lapply(unique(colnames(dframe)),
              function(x){
                         dframe[,if(sum(colnames(dframe) == x) <= nc) which(colnames(dframe) == x) else sample(which(colnames(dframe) == x),nc,replace = F)]
                         }
))

It might look complicated, but it really just takes all columns per group if there's less than nc , and samples random nc columns if there are more than nc columns. 它看起来复杂,但它真的只是需要每组所有列,如果有比这少nc ,和样品随机nc列,如果有超过nc列。

And to restore your original column-name scheme, gsub does the trick: 要恢复原始的列名方案,gsub可以解决这个问题:

colnames(res) <- gsub('.[[:digit:]]','',colnames(res))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM