简体   繁体   English

如何基于特定列号中的值对data.table进行子集

[英]How to subset data.table on the basis of a values in a certain column number

In data.table, one way to subset a table on the basis of a numerical vector of column numbers involves using with=FALSE . 在data.table中,一种基于列号的数值向量对表进行子集化的方法是使用with=FALSE

I'm trying to loop through a data.table on the basis of a numerical vector of column numbers, looking for rows meeting a certain criterion, as follows: 我正在尝试根据列号的数值向量遍历data.table,以寻找符合特定条件的 ,如下所示:

require(data.table)

ab=data.table(id=c("geneA", "geneB", "geneC", "geneA", "geneA", "geneB", "", "NA"),
              co1=c(1,2,3,0,7), co2=c(0,0,4,5,6), nontarget=c(9,0,7,6,5), 
              co3=c(0,1,2,3,4))
target_col_nums=grep('co', colnames(ab))

##Data.table doesn't treat colnames(ab)[i] as one of the
##  column name variables, and with=F only seems to work for j in dt[i,j,by]
for (i in target_col_nums){
    print(ab[colnames(ab)[i]>3])
}

##This produces the desired output
ab[co1>3]
ab[co2>3]
ab[co3>3]

In my situation, my actual table is quite large, so I can't use the colnames themselves. 在我的情况下,我的实际表很大,因此我不能使用colnames本身。

I hope that this is a useful question to the community. 我希望这是对社区有用的问题。

for (col in grep('co', names(ab), value = T))
  print(ab[get(col) > 3])
#      id co1 co2 nontarget co3
#1: geneA   7   6         5   4
#      id co1 co2 nontarget co3
#1: geneC   3   4         7   2
#2: geneA   0   5         6   3
#3: geneA   7   6         5   4
#4:    NA   3   4         7   2
#      id co1 co2 nontarget co3
#1: geneA   7   6         5   4

You can evaluate ( eval ) the columns as an expression 您可以将( eval )列作为表达式求值

for (i in target_col_nums){
    expr <- paste0(colnames(ab)[i], ">3")
    print(ab[eval(parse(text = expr)), ])
}

#      id co1 co2 nontarget co3
#1: geneA   7   6         5   4
#      id co1 co2 nontarget co3
#1: geneC   3   4         7   2
#2: geneA   0   5         6   3
#3: geneA   7   6         5   4
#4:    NA   3   4         7   2
#      id co1 co2 nontarget co3
#1: geneA   7   6         5   4

Or you can try any of the suggestions in the question passing variables as data.table column names 或者您可以尝试在问题中将变量作为data.table列名的任何建议

Your approach can be adjusted very slightly and still get around using column numbers (which, though not so harmful in this case since you got the numbers programatically, is generally bad practice): 您可以对方法进行很小的调整,但仍然可以使用列号来解决(尽管在这种情况下,由于您以编程方式获得了列数,所以危害不大),但通常是不好的做法:

target_cols = names(ab)[grepl("co", names(ab))]

sapply(target_cols, function(jj) print(ab[get(jj) > 3]))

Wrap in invisible if the NULL input is a distraction/otherwise bothers you. 如果NULL输入会使人分心,则将其invisible ,否则会打扰您。

We can specify the 'i' in .SDcols and use the condition on .SD to get a logical vector, which can be used for subsetting the rows. 我们可以在.SDcols指定'i'并使用.SD上的条件来获取逻辑向量,该逻辑向量可用于子集行。

for(i in target_col_nums){
 print(ab[ab[, .SD[[1L]] >3, .SDcols = i]])
}
#         id co1 co2 nontarget co3
#1: geneA   7   6         5   4
#      id co1 co2 nontarget co3
#1: geneC   3   4         7   2
#2: geneA   0   5         6   3
#3: geneA   7   6         5   4
#4:    NA   3   4         7   2
#      id co1 co2 nontarget co3
#1: geneA   7   6         5   4

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM