根据正则表达式替换数据框列

Question

I am trying to extract part of a column in a data frame using regular expressions. 我试图使用正则表达式提取数据框中的部分列。 Problems I am running into include the facts that grep returns the whole value, not just the matched part, and that str_extract doesn't seem to work in a vectorized way. 我str_extract问题包括grep返回整个值的事实，而不仅仅是匹配的部分，并且str_extract似乎不能以矢量化方式工作。

Here is what I'm trying. 这是我正在尝试的。 I would like df$match to show alpha.alpha. 我想df$match来显示alpha.alpha. where the pattern exists and NA otherwise. 模式存在的地方，否则NA 。 How can I show only the matched part? 如何只显示匹配的部分？

Also, how I can I replace [a-zA-Z] in R regex? 另外，我如何在R正则表达式中替换[a-zA-Z] ？ Can I use a character class or a POSIX code like [:alpha:] ? 我可以使用像[:alpha:]这样的字符类或POSIX代码吗？

v1 <- c(1:4)
v2 <- c("_a.b._", NA, "_C.D._", "_ef_")
df <- data.frame(v1, v2, stringsAsFactors = FALSE)

df$match <- grepl("[a-zA-Z]\\.[a-zA-Z]\\.", df$v2)
df$match

#TRUE FALSE  TRUE FALSE

v2grep <- grep("[a-zA-Z]\\.[a-zA-Z]\\.", df$v2, value = TRUE)

df$match[df$match == TRUE] <- v2grep
df$match[df$match == FALSE] <- NA

df

#v1  v2      match
#1   _a.b._  _a.b._
#2   <NA>    <NA>
#3   _C.D._  _C.D._
#4   _ef_    <NA>

What I want: 我想要的是：

#v1  v2      match
#1   _a.b._  a.b.
#2   <NA>    <NA>
#3   _C.D._  C.D.
#4   _ef_    <NA>

Answer 1

4 Approaches... 4方法......

Here's 2 approaches in base as well as with rm_default(extract=TRUE) in the qdapRegex package I maintain and the stringi package. 这里有2个基本方法，以及我维护的qdapRegex包和stringi包中的rm_default rm_default(extract=TRUE) 。

unlist(sapply(regmatches(df[["v2"]], gregexpr("[a-zA-Z]\\.[a-zA-Z]\\.", df[["v2"]])), function(x){
        ifelse(identical(character(0), x), NA, x)
    })
)

## [1] "a.b." NA     "C.D." NA 

pat <- "(.*?)([a-zA-Z]\\.[a-zA-Z]\\.)(.*?)$"
df[["v2"]][!grepl(pat, df[["v2"]])] <- NA
df[["v2"]] <- gsub(pat, "\\2", df[["v2"]])

## [1] "a.b." NA     "C.D." NA

library(qdapRegex)
unlist(rm_default(df[["v2"]], pattern = "[a-zA-Z]\\.[a-zA-Z]\\.", extract = TRUE))

## [1] "a.b." NA     "C.D." NA 

library(stringi)
stri_extract_first_regex(df[["v2"]], "[a-zA-Z]\\.[a-zA-Z]\\.")

## [1] "a.b." NA     "C.D." NA

Answer 2

Base R solution using regmatches , and regexpr which returns -1 if no regex match is found: 使用regmatches基本R解决方案，如果没有找到正则表达式匹配则返回-1 regexpr ：

r <- regexpr("[a-zA-Z]\\.[a-zA-Z]\\.", df$v2)
df$match <- NA
df$match[which(r != -1)] <- regmatches(df$v2, r)

#  v1     v2 match
#1  1 _a.b._  a.b.
#2  2   <NA>  <NA>
#3  3 _C.D._  C.D.
#4  4   _ef_  <NA>

Answer 3

One possible solution using both grepl and sub : 使用grepl和sub一种可能的解决方案：

# First, remove unwanted characters around pattern when detected
df$match <- sub(pattern = ".*([a-zA-Z]\\.[a-zA-Z]\\.).*", 
                replacement = "\\1", x = df$v2)
# Second, check if pattern is present; otherwise set to NA
df$match <- ifelse(grepl(pattern = "[a-zA-Z]\\.[a-zA-Z]\\.", x = df$match),
                   yes = df$match, no = NA)

Results 结果

df

#   v1     v2 match
# 1  1 _a.b._  a.b.
# 2  2   <NA>  <NA>
# 3  3 _C.D._  C.D.
# 4  4   _ef_  <NA>

Data 数据

v1 <- c(1:4)
v2 <- c("_a.b._", NA, "_C.D._", "_ef_")
df <- data.frame(v1, v2, stringsAsFactors = FALSE)

根据正则表达式替换数据框列

问题描述

3 个解决方案

解决方案1
4 已采纳 2015-04-09 03:14:04

解决方案2
4 2015-04-09 03:40:43

解决方案3
3 2015-04-09 03:21:16

根据正则表达式替换数据框列

问题描述

3 个解决方案

解决方案1 4 已采纳 2015-04-09 03:14:04

解决方案2 4 2015-04-09 03:40:43

解决方案3 3 2015-04-09 03:21:16

解决方案1
4 已采纳 2015-04-09 03:14:04

解决方案2
4 2015-04-09 03:40:43

解决方案3
3 2015-04-09 03:21:16