简体   繁体   English

R gregexpr上的正则表达式匹配

[英]regex match on R gregexpr

I'm trying to get count the instances of 3 consecutive "a" events, "aaa" . 我正在试图计算连续3次“a”事件的实例, "aaa"

The string will comprise the lower alphabet, eg "abaaaababaaa" 该字符串将包含较低的字母,例如"abaaaababaaa"

I tried the following piece of code. 我尝试了下面这段代码。 But the behavior is not precisely what I am looking for. 但这种行为并不是我想要的。

x<-"abaaaababaaa";
gregexpr("aaa",x);

I would like the match to return 3 instances of the "aaa" occurrence as opposed to 2. 我希望匹配返回3个“aaa”事件的实例,而不是2。

Assume indexation begins with 1 假设索引从1开始

  • The first occurrence of "aaa" is at index 3. 第一次出现的“aaa”是指数3。
  • The second occurrence of "aaa" is at index 4. (this is not caught by gregexpr) 第二次出现的“aaa”是在索引4处。(这不是由gregexpr捕获的)
  • The third occurrence of "aaa" is at index 10. 第三次出现的“aaa”是指数10。

To catch the overlapping matches, you can use a lookahead like this: 要捕获重叠匹配,您可以使用这样的前瞻:

gregexpr("a(?=aa)", x, perl=TRUE)

However, your matches are now just a single "a", so it might complicate further processing of these matches, especially if you're not always looking for fixed-length patterns. 但是,您的匹配现在只是一个“a”,因此可能会使这些匹配的进一步处理变得复杂,特别是如果您并不总是寻找固定长度的模式。

I know I'm late, but I wanted to share this solution, 我知道我迟到了,但我想分享这个解决方案,

your.string <- "abaaaababaaa"
nc1 <- nchar(your.string)-1
x <- unlist(strsplit(your.string, NULL))
x2 <- c()
for (i in 1:nc1)
x2 <- c(x2, paste(x[i], x[i+1], x[i+2], sep="")) 
cat("ocurrences of <aaa> in <your.string> is,", 
    length(grep("aaa", x2)), "and they are at index", grep("aaa", x2))
> ocurrences of <aaa> in <your.string> is, 3 and they are at index 3 4 10

Heavily inspired by this answer from R-help by Fran. Fran的R-help得到了这个答案的大力启发。

Here is a way to extract all overlapping matches of varying length using gregexpr . 这是一种使用gregexpr提取不同长度的所有重叠匹配的方法。

x<-"abaaaababaaa"
# nest in lookahead + capture group
# to get all instances of the pattern "(ab)|b"
matches<-gregexpr('(?=((ab)|b))', x, perl=TRUE)
# regmatches will reference the match.length attr. to extract the strings
# so move match length data from 'capture.length' to 'match.length' attr
attr(matches[[1]], 'match.length') <- as.vector(attr(matches[[1]], 'capture.length')[,1])
# extract substrings
regmatches(x, matches)
# [[1]]
# [1] "ab" "b"  "ab" "b"  "ab" "b" 

The trick is to surround the pattern in a capture group and that capture group in a lookahead assertion. 诀窍是在捕获组中包围模式,并在先行断言中捕获组。 gregexpr will return a list containing the start positions with an attribute capture.length , a matrix where the first column is the match lengths of the first capture group. gregexpr将返回一个包含起始位置的列表,其属性为capture.length ,这是一个矩阵,其中第一列是第一个捕获组的匹配长度。 If you convert this into a vector and move it into the match.length attribute (which is all zeros, since the entire pattern was inside a lookahead assertion), you can pass it to regmatches to extract the strings. 如果将其转换为向量并将其移动到match.length属性(全部为零,因为整个模式位于前瞻断言中),您可以将其传递给regmatches以提取字符串。

As hinted by the type of the final result, with a few modifications, this can be vectorized, for the case when x is a list of strings. 正如最终结果的类型暗示的那样,通过一些修改,这可以被矢量化,对于x是字符串列表的情况。

x<-list(s1="abaaaababaaa", s2="ab")
matches<-gregexpr('(?=((ab)|b))', x, perl=TRUE)
# make a function that replaces match.length attr with capture.length
set.match.length<-
function(x) structure(x, match.length=as.vector(attr(x, 'capture.length')[,1]))
# set match.length to capture.length for each match object
matches<-lapply(matches, set.match.length)
# extract substrings
mapply(regmatches, x, lapply(matches, list))
# $s1
# [1] "ab" "b"  "ab" "b"  "ab" "b" 
# 
# $s2
# [1] "ab" "b" 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM