简体   繁体   English

使用grepl在R中进行模式匹配

[英]pattern matching in R using grepl

I have a dataframe dat like this 我有这样的数据帧dat

P pedigree cas 1 M rs2745406 T 2 M rs6939431 A 3 M SNP_DPB1_33156641 G 4 M SNP_DPB1_33156664_G P 5 M SNP_DPB1_33156664_A A 6 M SNP_DPB1_33156664_T A

I want to exclude all rows where the pedigree column starts with SNP_ and ends with either G, C, T, or A ( _[GCTA] ). 我想排除pedigreeSNP_并以G,C,T或A( _[GCTA] )结尾的所有行。 In this case, this would be rows 4,5,6. 在这种情况下,这将是行4,5,6。

How can I achieve this in R? 我怎样才能在R中实现这一目标? I have tried 我努力了

multisnp <- which(grepl("^SNP_*_[GCTA]$", dat$pedigree)=="TRUE")

new_dat <- dat[-multisnp,]

My multisnp vector is empty, but I can't figure out how to fix it so that it matches the pattern I want. 我的multisnp向量是空的,但我无法弄清楚如何修复它以便它匹配我想要的模式。 I think it is my wildcard * usage that is wrong. 我认为这是我的通配符*用法是错误的。

You can use the following with .*? 你可以使用以下.*? (match everything in non greedy way): (以非贪婪的方式匹配所有内容):

multisnp <- which(grepl("^SNP_.*?_[GCTA]$", dat$pedigree))
                              ^^^

You can subset dat like this 您可以像这样对dat进行子集化

new_dat <- dat[!grepl("^SNP_.*_[GCTA]$", dat$pedigree), ]

Regarding the code that you've tried, I'm not sure that grepl("^SNP_*_[GCTA]$") will complete without an error since you aren't passing in an x vector to grepl . 关于你尝试过的代码,我不确定grepl("^SNP_*_[GCTA]$")是否会在没有错误的情况下完成,因为你没有将x向量传递给grepl See ?grepl for more info. 有关详细信息,请参阅?grepl

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM