R带填充的条件替换/修剪（regex，gsub，gregexpr，regmatches）

Question

我有一个涉及条件替换的问题。

我本质上是想找到每个数字字符串，并为4之后的每个连续数字替换一个空格。

我需要矢量化解决方案，而速度至关重要。

这是一个可行的（但效率不高的解决方案）：

data <- data.frame(matrix(NA, ncol=2, nrow=6, dimnames=list(c(), c("input","output"))), 
                              stringsAsFactors=FALSE)
data[1,] <- c("STRING WITH 2 FIX(ES): 123456    098765  1111   ",NA)
data[2,] <- c(" PADDED STRING WITH 3 FIX(ES): 123456    098765  111111   ",NA)
data[3,] <- c(" STRING WITH 0 FIX(ES): 12        098     111   ",NA)
data[4,] <- c(NA,NA)
data[5,] <- c("1234567890",NA)
data[6,] <- c("   12345   67890    ",NA)

x2 <- data[,"input"]
x2

p1 <- "([0-9]+)"

m1 <- gregexpr(p1, x2,perl = TRUE)

nchar1 <- lapply(regmatches(x2, m1), function(x){
  if (length(x)==0){ x <- NA  } else ( x <- nchar(x))
  return(x) })

x3 <- mapply(function(match,length,text,cutoff) {
  temp_comb <- data.frame(match=match, length=length, stringsAsFactors=FALSE)

  for(i in which(temp_comb[,"length"] > cutoff))
  {
    before <- substr(text, 1, (temp_comb[i,"match"]-1))
    middle_4 <- substr(text, temp_comb[i,"match"], temp_comb[i,"match"]+cutoff-1)
    middle_space <-  paste(rep(" ", temp_comb[i,"length"]-cutoff),sep="",collapse="")
    after <-  substr(text, temp_comb[i,"match"]+temp_comb[i,"length"], nchar(text))
    text <- paste(before,middle_4,middle_space,after,sep="")
  }
  return(text)

},match=m1,length=nchar1,text=x2,cutoff=4)

data[,"output"] <- x3

有没有更好的办法？

我在帮助部分查找正则匹配项，并且有一个类似的类型问题，但是它完全替换为空格，而不是有条件的。

我会写一些替代方案并对其进行基准测试，但老实说，我想不出其他方法来做到这一点。

提前感谢您的帮助！

更新

斑点，

用您的方式但将截止作为输入，NA情况出现错误：

#replace numbers afther the 4th with spaces for those matches
zz<-lapply(regmatches(data$input, m), function(x,cutoff) {

    # x <- regmatches(data$input, m)[[4]]
    # cutoff <- 4

    mapply(function(x, n, cutoff){
      formatC(substr(x,1,cutoff), width=-n)
    }, x=x, n=nchar(x),cutoff=cutoff)

},cutoff=4)

Answer 1

这是一种仅需一个gsub命令的快速方法：

gsub("(?<!\\d)(\\d{4})\\d*", "\\1", data$input, perl = TRUE)
# [1] "STRING WITH 2 FIX(ES): 1234    0987  1111   "        
# [2] " PADDED STRING WITH 3 FIX(ES): 1234    0987  1111   "
# [3] " STRING WITH 0 FIX(ES): 12        098     111   "    
# [4] NA                                                    
# [5] "1234"                                                
# [6] "   1234   6789    "

字符串(?<!\\\\d)表示否定的前瞻：位置前无数字。 字符串(\\\\d{4})表示4个连续数字 。 最后， \\\\d*代表任意数量的数字。 与该正则表达式匹配的字符串部分由第一组（前4位数字）代替。

不改变字符串长度的方法：

matches <- gregexpr("(?<=\\d{4})\\d+", data$input, perl = TRUE)
mapply(function(m, d) {
  if (!is.na(m) && m != -1L) {
    for (i in seq_along(m)) {
      substr(d, m[i], m[i] + attr(m, "match.length") - 1L) <- paste(rep(" ", attr(m, "match.length")[i]), collapse = "")
    }
  }
  return(d)
}, matches, data$input)

# [1] "STRING WITH 2 FIX(ES): 1234      0987    1111   "          
# [2] " PADDED STRING WITH 3 FIX(ES): 1234      0987    1111     "
# [3] " STRING WITH 0 FIX(ES): 12        098     111   "          
# [4] NA                                                          
# [5] "1234      "                                                
# [6] "   1234    6789     "

Answer 2

您可以使用以下命令在一行中进行相同的操作（一位数字用一个空格） ：

gsub("(?:\\G(?!\\A)|\\d{4})\\K\\d", " ", data$input, perl = TRUE)

细节：

(?:        # non-capturing group: the two possible entry points
    \G     # either the position after the last match or the start of the string
    (?!\A) # exclude the start of the string position
  |        # OR
    \d{4}  # four digits
)          # close the non-capturing group
\K         # removes all on the left from the match result
\d         # a single digit

Answer 3

这是使用gregexpr和regmatches

#find all numbers with more than 4 digits
m <- gregexpr("\\d{5,}", data$input)

#replace numbers afther the 4th with spaces for those matches
zz<-lapply(regmatches(data$input, m), function(x) {
        mapply(function(x, n) formatC(substr(x,1,4), width=-n), x, nchar(x))
})

#combine with original values
data$output2 <- unlist(Map(function(a,b) paste0(a,c(b,""), collapse=""), 
    regmatches(data$input, m, invert=T), zz))

此处的不同之处在于它将NA值转换为"" 。 我们可以添加其他检查来防止这种情况，或者只是在末尾将所有零长度的字符串转换为缺失值。 我只是不想通过安全检查使代码过于复杂。

R带填充的条件替换/修剪（regex，gsub，gregexpr，regmatches）

问题描述

3 个解决方案

解决方案1
1 2014-08-06 19:12:31

解决方案2
1 2014-08-06 21:13:38

解决方案3
0 已采纳 2014-08-06 19:36:05

R带填充的条件替换/修剪（regex，gsub，gregexpr，regmatches）

问题描述

3 个解决方案

解决方案1 1 2014-08-06 19:12:31

解决方案2 1 2014-08-06 21:13:38

解决方案3 0 已采纳 2014-08-06 19:36:05

解决方案1
1 2014-08-06 19:12:31

解决方案2
1 2014-08-06 21:13:38

解决方案3
0 已采纳 2014-08-06 19:36:05