[英]What is an elegant way to filter a data table based on a calculation of the values of two columns? [R]
I have a data table, let's call it lung
:我有一个数据表,我们称之为
lung
:
> lung
variant_id transcript_id is_NL counts nrows
1: chr10_129450960_T_C_b38 chr10_129467297_129536240 0 33029 458
2: chr10_129450960_T_C_b38 chr10_129467297_129536240 1 3477 54
3: chr10_129450960_T_C_b38 chr10_129467297_129536240 2 130 3
4: chr10_129450960_T_C_b38 chr10_129536378_129563778 0 51 458
5: chr10_129450960_T_C_b38 chr10_129536378_129563778 1 8 54
---
500148: chr9_34699703_G_C_b38 chr9_34649082_34649409 1 4214 57
500149: chr9_34699703_G_C_b38 chr9_34649082_34649409 2 171 2
500150: chr9_34699703_G_C_b38 chr9_34649565_34650368 0 48713 456
500151: chr9_34699703_G_C_b38 chr9_34649565_34650368 1 4932 57
500152: chr9_34699703_G_C_b38 chr9_34649565_34650368 2 208 2
I would like to filter it such that when is_NL == 0
, the only rows preserved are those which counts/nrows < 50
( 50
being an arbitrary number), and when is_NL
is 1
or 2
, the only rows that are preserved are those which counts/nrows > 50
.我想过滤它,以便当
is_NL == 0
,唯一保留的行是那些counts/nrows < 50
( 50
是一个任意数字),当is_NL
是1
或2
,唯一保留的行是那些其中counts/nrows > 50
。
So far, I've only been able to come up with this:到目前为止,我只能想出这个:
> lung[which(lung[is_NL == 0][,counts]/lung[is_NL == 0][,nrows] < 50),]
variant_id transcript_id is_NL counts nrows
1: chr10_129450960_T_C_b38 chr10_129467297_129536240 1 3477 54
2: chr10_129450960_T_C_b38 chr10_129536378_129563778 0 51 458
3: chr10_129450960_T_C_b38 chr10_129536378_129563778 1 8 54
4: chr10_129450960_T_C_b38 chr10_129536378_129707894 0 37918 458
5: chr10_129450960_T_C_b38 chr10_129701913_129707894 0 188 458
---
147877: chr17_45825156_G_A_b38 chr17_46148240_46152903 2 17 20
147878: chr17_45825156_G_A_b38 chr17_46152967_46156773 0 3 336
147879: chr17_45825156_G_A_b38 chr17_46152967_46169530 0 5 336
147880: chr17_45825156_G_A_b38 chr17_46152967_46169530 1 137 159
147881: chr17_45825156_G_A_b38 chr17_46156896_46170854 0 18 336
> lung[which(lung[is_NL > 0]$counts/lung[is_NL > 0]$nrows > 50),]
variant_id transcript_id is_NL counts nrows
1: chr10_129450960_T_C_b38 chr10_129467297_129536240 0 33029 458
2: chr10_129450960_T_C_b38 chr10_129536378_129563778 1 8 54
3: chr10_129450960_T_C_b38 chr10_129701913_129707894 1 24 54
4: chr10_129450960_T_C_b38 chr10_129701913_129707894 2 2 3
5: chr10_129450960_T_C_b38 chr10_129708044_129715519 2 0 3
---
50195: chr17_46025930_T_C_b38 chr17_46039885_46050532 0 14129 337
50196: chr17_46025930_T_C_b38 chr17_46050705_46066536 0 14106 337
50197: chr17_46025930_T_C_b38 chr17_46050705_46066536 1 6658 158
50198: chr17_46025930_T_C_b38 chr17_46050705_46066536 2 809 20
50199: chr17_46025930_T_C_b38 chr17_46066733_46067548 0 12842 337
which, as you can tell by looking at the is_NL
column, does not work.正如您通过查看
is_NL
列可以看出的is_NL
,它不起作用。 I could probably subset into two different tables first, apply the comparison filter ( <
or >
50
), and then figure out how to merge them, but I feel like there should be a simpler way to do this that I don't know about.我可以先将子集分成两个不同的表,应用比较过滤器(
<
或>
50
),然后弄清楚如何合并它们,但我觉得应该有一种更简单的方法来做到这一点,我不知道.
In base R, you could do something like:在基础 R 中,您可以执行以下操作:
lung[with(lung, (is_NL == 0 & counts/nrows < 50) |
(is_NL %in% c(1,2) & counts/nrows > 50)),]
# output
variant_id transcript_id is_NL counts nrows
2 chr10_129450960_T_C_b38 chr10_129467297_129536240 1 3477 54
4 chr10_129450960_T_C_b38 chr10_129536378_129563778 0 51 458
where I created lung
as the first 5 lines in your example:在您的示例中,我将
lung
创建为前 5 行:
lung <- structure(list(variant_id = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "chr10_129450960_T_C_b38", class = "factor"),
transcript_id = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("chr10_129467297_129536240",
"chr10_129536378_129563778"), class = "factor"), is_NL = c(0L,
1L, 2L, 0L, 1L), counts = c(33029L, 3477L, 130L, 51L, 8L),
nrows = c(458L, 54L, 3L, 458L, 54L)), class = "data.frame", row.names = c(NA,
-5L))
Using data.table
使用数据
data.table
library(data.table)
setDT(lung)[!is_NL & counts/.N < 50|(is_NL %in% c(1, 2) & counts/.N > 50)]
I would create a flag using tidyverse
:我会使用
tidyverse
创建一个标志:
lung %>%
mutate(FLG = if_else(is_NL == 0 & counts/nrows < 50, 1
if_else(is_NL in (1,2) &counts/nrows >50, 1,0))) %>%
filter(FLG == 1)
If you want to use base R then如果你想使用基础 R 那么
lung[which((is_NL == 0 & counts/nrows < 50)|(is_NL in (1,2) &counts/nrows >50)),]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.