简体   繁体   English

R:基于多行中的列条目的子数据框

[英]R: subset dataframe based on column entry in multiple rows

I have a dataframe with information on several genes in a format similar to: 我有一个数据框,其中包含有关几种基因的信息,格式类似于:

chr    start    end    Gene    Region
1    100    110    Bat     Exon
1    120    130    Bat     Intron
1    500    550    Ball    Upstream, Downstream
1    590    600    Ball    Intron, Upstream
1    900    980    Mit     Promoter, Upstream

I would like to subset the data to remove any rows that contains genes that have "Exon" or "Promoter" in the Regions column. 我想对数据进行子集删除,以删除任何包含“区域”列中具有“外显子”或“启动子”的基因的行。 I had been using: 我一直在使用:

Regions <- subset(Table, Region == "Intron" | Region== "DownStream" | Region =="Upstream" | Region=="DownStream,Upstream")

However this gives me: 但这给了我:

chr    start    end    Gene    Region
1    120    130    Bat     Intron
1    500    550    Ball    Upstream, Downstream
1    590    600    Ball    Intron, Upstream

What I want is: 我想要的是:

chr    start    end    Gene    Region
1    500    550    Ball    Upstream, Downstream
1    590    600    Ball    Intron, Upstream

Try this using grepl : 使用grepl尝试grepl

df[!grepl("Exon|Promoter", df$Region),]
#  chr start end Gene               Region
#2   1   120 130  Bat               Intron
#3   1   500 550 Ball Upstream, Downstream
#4   1   590 600 Ball     Intron, Upstream

It's not clear to me why you want the row 2 with "Intron" removed as well. 我不清楚,为什么还要删除“ Intron”的第二行。 Please explain that. 请解释一下。

Edit: 编辑:

Think I understood now, try this instead: 以为我现在明白了,试试看:

temp <- df$Gene[grepl("Exon|Promoter", df$Region)]
df[!df$Gene %in% temp,]
#  chr start end Gene               Region
#3   1   500 550 Ball Upstream, Downstream
#4   1   590 600 Ball     Intron, Upstream

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM