简体   繁体   English

基于两列值的计算过滤数据表的优雅方法是什么? [R]

[英]What is an elegant way to filter a data table based on a calculation of the values of two columns? [R]

I have a data table, let's call it lung :我有一个数据表,我们称之为lung

> lung
                     variant_id             transcript_id is_NL counts nrows
     1: chr10_129450960_T_C_b38 chr10_129467297_129536240     0  33029   458
     2: chr10_129450960_T_C_b38 chr10_129467297_129536240     1   3477    54
     3: chr10_129450960_T_C_b38 chr10_129467297_129536240     2    130     3
     4: chr10_129450960_T_C_b38 chr10_129536378_129563778     0     51   458
     5: chr10_129450960_T_C_b38 chr10_129536378_129563778     1      8    54
    ---
500148:   chr9_34699703_G_C_b38    chr9_34649082_34649409     1   4214    57
500149:   chr9_34699703_G_C_b38    chr9_34649082_34649409     2    171     2
500150:   chr9_34699703_G_C_b38    chr9_34649565_34650368     0  48713   456
500151:   chr9_34699703_G_C_b38    chr9_34649565_34650368     1   4932    57
500152:   chr9_34699703_G_C_b38    chr9_34649565_34650368     2    208     2

I would like to filter it such that when is_NL == 0 , the only rows preserved are those which counts/nrows < 50 ( 50 being an arbitrary number), and when is_NL is 1 or 2 , the only rows that are preserved are those which counts/nrows > 50 .我想过滤它,以便当is_NL == 0 ,唯一保留的行是那些counts/nrows < 5050是一个任意数字),当is_NL12 ,唯一保留的行是那些其中counts/nrows > 50

So far, I've only been able to come up with this:到目前为止,我只能想出这个:

> lung[which(lung[is_NL == 0][,counts]/lung[is_NL == 0][,nrows] < 50),]
                     variant_id             transcript_id is_NL counts nrows
     1: chr10_129450960_T_C_b38 chr10_129467297_129536240     1   3477    54
     2: chr10_129450960_T_C_b38 chr10_129536378_129563778     0     51   458
     3: chr10_129450960_T_C_b38 chr10_129536378_129563778     1      8    54
     4: chr10_129450960_T_C_b38 chr10_129536378_129707894     0  37918   458
     5: chr10_129450960_T_C_b38 chr10_129701913_129707894     0    188   458
    ---
147877:  chr17_45825156_G_A_b38   chr17_46148240_46152903     2     17    20
147878:  chr17_45825156_G_A_b38   chr17_46152967_46156773     0      3   336
147879:  chr17_45825156_G_A_b38   chr17_46152967_46169530     0      5   336
147880:  chr17_45825156_G_A_b38   chr17_46152967_46169530     1    137   159
147881:  chr17_45825156_G_A_b38   chr17_46156896_46170854     0     18   336
> lung[which(lung[is_NL > 0]$counts/lung[is_NL > 0]$nrows > 50),]
                    variant_id             transcript_id is_NL counts nrows
    1: chr10_129450960_T_C_b38 chr10_129467297_129536240     0  33029   458
    2: chr10_129450960_T_C_b38 chr10_129536378_129563778     1      8    54
    3: chr10_129450960_T_C_b38 chr10_129701913_129707894     1     24    54
    4: chr10_129450960_T_C_b38 chr10_129701913_129707894     2      2     3
    5: chr10_129450960_T_C_b38 chr10_129708044_129715519     2      0     3
   ---
50195:  chr17_46025930_T_C_b38   chr17_46039885_46050532     0  14129   337
50196:  chr17_46025930_T_C_b38   chr17_46050705_46066536     0  14106   337
50197:  chr17_46025930_T_C_b38   chr17_46050705_46066536     1   6658   158
50198:  chr17_46025930_T_C_b38   chr17_46050705_46066536     2    809    20
50199:  chr17_46025930_T_C_b38   chr17_46066733_46067548     0  12842   337

which, as you can tell by looking at the is_NL column, does not work.正如您通过查看is_NL列可以看出的is_NL ,它不起作用。 I could probably subset into two different tables first, apply the comparison filter ( < or > 50 ), and then figure out how to merge them, but I feel like there should be a simpler way to do this that I don't know about.我可以先将子集分成两个不同的表,应用比较过滤器( <> 50 ),然后弄清楚如何合并它们,但我觉得应该有一种更简单的方法来做到这一点,我不知道.

In base R, you could do something like:在基础 R 中,您可以执行以下操作:

lung[with(lung, (is_NL == 0 & counts/nrows < 50) | 
                (is_NL %in% c(1,2) & counts/nrows > 50)),]
# output
               variant_id             transcript_id is_NL counts nrows
2 chr10_129450960_T_C_b38 chr10_129467297_129536240     1   3477    54
4 chr10_129450960_T_C_b38 chr10_129536378_129563778     0     51   458

where I created lung as the first 5 lines in your example:在您的示例中,我将lung创建为前 5 行:

lung <- structure(list(variant_id = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "chr10_129450960_T_C_b38", class = "factor"), 
    transcript_id = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("chr10_129467297_129536240", 
    "chr10_129536378_129563778"), class = "factor"), is_NL = c(0L, 
    1L, 2L, 0L, 1L), counts = c(33029L, 3477L, 130L, 51L, 8L), 
    nrows = c(458L, 54L, 3L, 458L, 54L)), class = "data.frame", row.names = c(NA, 
-5L))

Using data.table使用数据data.table

library(data.table)
setDT(lung)[!is_NL & counts/.N < 50|(is_NL %in% c(1, 2) & counts/.N > 50)]

I would create a flag using tidyverse :我会使用tidyverse创建一个标志:

lung %>% 
mutate(FLG = if_else(is_NL == 0 & counts/nrows < 50, 1 
             if_else(is_NL in (1,2) &counts/nrows >50, 1,0))) %>% 
filter(FLG == 1)

If you want to use base R then如果你想使用基础 R 那么

lung[which((is_NL == 0 & counts/nrows < 50)|(is_NL in (1,2) &counts/nrows >50)),]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 R 基于行值的双向表中的列顺序 - R order columns in a two-way table based on Row values R:一种基于另一个表中的值进行过滤的方法? - R: A way to filter based on values in another table? 根据两列中带有`r`的反字符串值过滤唯一值 - filter distinct value based on two columns with inverse string values in `r` 使用 R 基于多个条件过滤记录的优雅方法 - Elegant way to filter records based on multiple criteria using R 将 function 应用于 data.table 或 data.frame 中的多对列的最优雅方法是什么? - What is the most elegant way to apply a function to multiple pairs of columns in a data.table or data.frame? 比较两个数据帧并基于r中的值过滤值 - Comparing two data frames and filter the values based on their values in r 基于R中其他列中的数据的条件计算 - Conditional calculation based on data in other columns in R 基于两列的数据框内的计算 - Calculation inside a data frame based on two columns 根据两列之间的匹配值(精确)过滤数据帧 - Filter a data frame based on matching values (exact) between two columns R: Add boolean column to a data.table based on return values of a function which evaluates two columns from different data.table - R: Add boolean column to a data.table based on return values of a function which evaluates two columns from different data.table
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM