R gregexpr上的正则表达式匹配

Question

I'm trying to get count the instances of 3 consecutive "a" events, "aaa" . 我正在试图计算连续3次“a”事件的实例， "aaa" 。

The string will comprise the lower alphabet, eg "abaaaababaaa" 该字符串将包含较低的字母，例如"abaaaababaaa"

I tried the following piece of code. 我尝试了下面这段代码。 But the behavior is not precisely what I am looking for. 但这种行为并不是我想要的。

x<-"abaaaababaaa";
gregexpr("aaa",x);

I would like the match to return 3 instances of the "aaa" occurrence as opposed to 2. 我希望匹配返回3个“aaa”事件的实例，而不是2。

Assume indexation begins with 1 假设索引从1开始

The first occurrence of "aaa" is at index 3. 第一次出现的“aaa”是指数3。
The second occurrence of "aaa" is at index 4. (this is not caught by gregexpr) 第二次出现的“aaa”是在索引4处。（这不是由gregexpr捕获的）
The third occurrence of "aaa" is at index 10. 第三次出现的“aaa”是指数10。

Answer 1

To catch the overlapping matches, you can use a lookahead like this: 要捕获重叠匹配，您可以使用这样的前瞻：

gregexpr("a(?=aa)", x, perl=TRUE)

However, your matches are now just a single "a", so it might complicate further processing of these matches, especially if you're not always looking for fixed-length patterns. 但是，您的匹配现在只是一个“a”，因此可能会使这些匹配的进一步处理变得复杂，特别是如果您并不总是寻找固定长度的模式。

Answer 2

I know I'm late, but I wanted to share this solution, 我知道我迟到了，但我想分享这个解决方案，

your.string <- "abaaaababaaa"
nc1 <- nchar(your.string)-1
x <- unlist(strsplit(your.string, NULL))
x2 <- c()
for (i in 1:nc1)
x2 <- c(x2, paste(x[i], x[i+1], x[i+2], sep="")) 
cat("ocurrences of <aaa> in <your.string> is,", 
    length(grep("aaa", x2)), "and they are at index", grep("aaa", x2))
> ocurrences of <aaa> in <your.string> is, 3 and they are at index 3 4 10

Heavily inspired by this answer from R-help by Fran. Fran的R-help得到了这个答案的大力启发。

Answer 3

Here is a way to extract all overlapping matches of varying length using gregexpr . 这是一种使用gregexpr提取不同长度的所有重叠匹配的方法。

x<-"abaaaababaaa"
# nest in lookahead + capture group
# to get all instances of the pattern "(ab)|b"
matches<-gregexpr('(?=((ab)|b))', x, perl=TRUE)
# regmatches will reference the match.length attr. to extract the strings
# so move match length data from 'capture.length' to 'match.length' attr
attr(matches[[1]], 'match.length') <- as.vector(attr(matches[[1]], 'capture.length')[,1])
# extract substrings
regmatches(x, matches)
# [[1]]
# [1] "ab" "b"  "ab" "b"  "ab" "b"

The trick is to surround the pattern in a capture group and that capture group in a lookahead assertion. 诀窍是在捕获组中包围模式，并在先行断言中捕获组。 gregexpr will return a list containing the start positions with an attribute capture.length , a matrix where the first column is the match lengths of the first capture group. gregexpr将返回一个包含起始位置的列表，其属性为capture.length ，这是一个矩阵，其中第一列是第一个捕获组的匹配长度。 If you convert this into a vector and move it into the match.length attribute (which is all zeros, since the entire pattern was inside a lookahead assertion), you can pass it to regmatches to extract the strings. 如果将其转换为向量并将其移动到match.length属性（全部为零，因为整个模式位于前瞻断言中），您可以将其传递给regmatches以提取字符串。

As hinted by the type of the final result, with a few modifications, this can be vectorized, for the case when x is a list of strings. 正如最终结果的类型暗示的那样，通过一些修改，这可以被矢量化，对于x是字符串列表的情况。

x<-list(s1="abaaaababaaa", s2="ab")
matches<-gregexpr('(?=((ab)|b))', x, perl=TRUE)
# make a function that replaces match.length attr with capture.length
set.match.length<-
function(x) structure(x, match.length=as.vector(attr(x, 'capture.length')[,1]))
# set match.length to capture.length for each match object
matches<-lapply(matches, set.match.length)
# extract substrings
mapply(regmatches, x, lapply(matches, list))
# $s1
# [1] "ab" "b"  "ab" "b"  "ab" "b" 
# 
# $s2
# [1] "ab" "b"

R gregexpr上的正则表达式匹配

问题描述

3 个解决方案

解决方案1
6 已采纳 2013-01-22 04:26:07

解决方案2
1 2013-01-22 04:55:28

解决方案3
0 2013-01-22 06:11:27

R gregexpr上的正则表达式匹配

问题描述

3 个解决方案

解决方案1 6 已采纳 2013-01-22 04:26:07

解决方案2 1 2013-01-22 04:55:28

解决方案3 0 2013-01-22 06:11:27

解决方案1
6 已采纳 2013-01-22 04:26:07

解决方案2
1 2013-01-22 04:55:28

解决方案3
0 2013-01-22 06:11:27