R regex: issues with character vectors containing NAs

Question

I was trying to collapse all multiple (2 or more) whitespace characters within elements of a vector into a single one, using gsub() , eg:

x1 <- c("  abc", "a b c    ", "a  b c")
gsub("\\s{2,}", " ", x1)
[1] " abc"   "a b c " "a b c"

But as soon as the vector contains NA the substitution fails:

x2 <- c(NA, "  abc", "a b c    ", "a  b c")
gsub("\\s{2,}", " ", x2)
[1] NA  " " " " " "

However, it works fine if one uses Perl-like regular expressions:

gsub("\\s{2,}", " ", x2, perl = TRUE)
[1] NA       " abc"   "a b c " "a b c"

Does anyone have suggestions as to why R's own regular expressions behave in that way? I'm using R 3.1.1 on Linux x86-64 if that helps.

Answer 1

I haven't poked at the source code but it also works if you use the useBytes=TRUE parameter (without the perl=TRUE parameter). From the help: " if useBytes is TRUE the matching is done byte-by-byte rather than character-by-character. " That may be part of why it's failing in gsub .

However, regexpr , regexec and gregexpr each find all the correct positions (I have substituted \\\\s with [[:space:]]: for readability and only used output from regexpr :

regexpr("[[:space:]]{2,}", x2)

## [1] NA  1  1  1
## attr(,"match.length")
## [1] NA  5  9  6

So, the regex itself is fine.

Update: a quick glance at do_gsub in R 3.1.1's grep.c didn't yield much insight (it's a twisted maze of if/else statements :-), but I'd almost want to call this a bug.

Answer 2

Just to wrap this question up: as several others suggested, the behaviour is in fact a bug. Reported and confirmed here:

https://bugs.r-project.org/bugzilla/show_bug.cgi?id=16009

R regex: issues with character vectors containing NAs

Question

2 answers

solution1
2 2014-10-03 11:55:42

solution2
1 2014-10-05 12:54:17

R regex: issues with character vectors containing NAs

Question

2 answers

solution1 2 2014-10-03 11:55:42

solution2 1 2014-10-05 12:54:17

solution1
2 2014-10-03 11:55:42

solution2
1 2014-10-05 12:54:17