简体   繁体   中英

R regex: issues with character vectors containing NAs

I was trying to collapse all multiple (2 or more) whitespace characters within elements of a vector into a single one, using gsub() , eg:

x1 <- c("  abc", "a b c    ", "a  b c")
gsub("\\s{2,}", " ", x1)
[1] " abc"   "a b c " "a b c"

But as soon as the vector contains NA the substitution fails:

x2 <- c(NA, "  abc", "a b c    ", "a  b c")
gsub("\\s{2,}", " ", x2)
[1] NA  " " " " " "

However, it works fine if one uses Perl-like regular expressions:

gsub("\\s{2,}", " ", x2, perl = TRUE)
[1] NA       " abc"   "a b c " "a b c"

Does anyone have suggestions as to why R's own regular expressions behave in that way? I'm using R 3.1.1 on Linux x86-64 if that helps.

I haven't poked at the source code but it also works if you use the useBytes=TRUE parameter (without the perl=TRUE parameter). From the help: " if useBytes is TRUE the matching is done byte-by-byte rather than character-by-character. " That may be part of why it's failing in gsub .

However, regexpr , regexec and gregexpr each find all the correct positions (I have substituted \\\\s with [[:space:]]: for readability and only used output from regexpr :

regexpr("[[:space:]]{2,}", x2)

## [1] NA  1  1  1
## attr(,"match.length")
## [1] NA  5  9  6

So, the regex itself is fine.

Update: a quick glance at do_gsub in R 3.1.1's grep.c didn't yield much insight (it's a twisted maze of if/else statements :-), but I'd almost want to call this a bug.

Just to wrap this question up: as several others suggested, the behaviour is in fact a bug. Reported and confirmed here:

https://bugs.r-project.org/bugzilla/show_bug.cgi?id=16009

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM