[英]How to prevent regmatches drop non matches?
I would like to capture the first match, and return NA
if there is no match.我想捕获第一场比赛,如果没有比赛则返回
NA
。
regexpr("a+", c("abc", "def", "cba a", "aa"), perl=TRUE)
# [1] 1 -1 3 1
# attr(,"match.length")
# [1] 1 -1 1 2
x <- c("abc", "def", "cba a", "aa")
m <- regexpr("a+", x, perl=TRUE)
regmatches(x, m)
# [1] "a" "a" "aa"
So I expected "a", NA, "a", "aa"所以我期待“a”,NA,“a”,“aa”
Staying with regexpr
:继续使用
regexpr
:
r <- regexpr("a+", x)
out <- rep(NA,length(x))
out[r!=-1] <- regmatches(x, r)
out
#[1] "a" NA "a" "aa"
use regexec
instead, since it returns a list which will allow you to catch the character(0)
's before unlist
ing改用
regexec
,因为它返回一个列表,允许您在unlist
之前捕获character(0)
的
R <- regmatches(x, regexec("a+", x))
unlist({R[sapply(R, length)==0] <- NA; R})
# [1] "a" NA "a" "aa"
In R 3.3.0, it is possible to pull out both the matches and the non-matched results using the invert=NA argument.在 R 3.3.0 中,可以使用 invert=NA 参数提取匹配和不匹配的结果。 From the help file, it says
从帮助文件中,它说
if invert is NA, regmatches extracts both non-matched and matched substrings, always starting and ending with a non-match (empty if the match occurred at the beginning or the end, respectively).
如果 invert 为 NA,则 regmatches 提取不匹配和匹配的子字符串,总是以不匹配开始和结束(如果匹配分别发生在开头或结尾,则为空)。
The output is a list, typically, in most cases of interest, (matching a single pattern), regmatches
with this argument will return a list with elements of either length 3 or 1. 1 is the case of where no matches are found and 3 is the case with a match.输出是一个列表,通常,在大多数感兴趣的情况下,(匹配单个模式),带有此参数的
regmatches
将返回一个包含长度为 3 或 1 的元素的列表。1 是找不到匹配项的情况,3是匹配的情况。
myMatch <- regmatches(x, m, invert=NA)
myMatch
[[1]]
[1] "" "a" "bc"
[[2]]
[1] "def"
[[3]]
[1] "cb" "a" " a"
[[4]]
[1] "" "aa" ""
So to extract what you want (with "" in place of NA), you can use sapply
as follows:所以要提取你想要的(用“”代替NA),你可以使用
sapply
如下:
myVec <- sapply(myMatch, function(x) {if(length(x) == 1) "" else x[2]})
myVec
[1] "a" "" "a" "aa"
At this point, if you really want NA instead of "", you can use此时,如果你真的想要 NA 而不是 "",你可以使用
is.na(myVec) <- nchar(myVec) == 0L
myVec
[1] "a" NA "a" "aa"
Some revisions :一些修订:
Note that you can collapse the last two lines into a single line:请注意,您可以将最后两行折叠成一行:
myVec <- sapply(myMatch, function(x) {if(length(x) == 1) NA_character_ else x[2]})
The default data type of NA
is logical, so using it will result in additional data conversions. NA
的默认数据类型是逻辑的,因此使用它会导致额外的数据转换。 Using the character version NA_character_
, avoids this.使用字符版本
NA_character_
可以避免这种情况。
An even slicker extraction method for the final line is to use [
:最后一行的更流畅的提取方法是使用
[
:
sapply(myMatch, `[`, 2)
[1] "a" NA "a" "aa"
So you can do the whole thing in a fairly readable single line:所以你可以在一个相当可读的单行中完成整个事情:
sapply(regmatches(x, m, invert=NA), `[`, 2)
Using more or less the same construction as yours -使用或多或少与您相同的结构-
chars <- c("abc", "def", "cba a", "aa")
chars[
regexpr("a+", chars, perl=TRUE) > 0
][1] #abc
chars[
regexpr("q", chars, perl=TRUE) > 0
][1] #NA
#vector[
# find all indices where regexpr returned positive value i.e., match was found
#][return the first element of the above subset]
Edit - Seems like I misunderstood the question.编辑 - 好像我误解了这个问题。 But since two people have found this useful I shall let it stay.
但既然有两个人发现这很有用,我就让它留下来。
You can use stringr::str_extract(string, pattern)
.您可以使用
stringr::str_extract(string, pattern)
。 It will return NA if there is no matches.如果没有匹配,它将返回 NA。 It has simpler function interface than
regmatches()
as well.它也具有比
regmatches()
更简单的函数接口。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.