简体   繁体   English

根据正则表达式替换数据框列

[英]Replace a data frame column based on regex

I am trying to extract part of a column in a data frame using regular expressions. 我试图使用正则表达式提取数据框中的部分列。 Problems I am running into include the facts that grep returns the whole value, not just the matched part, and that str_extract doesn't seem to work in a vectorized way. str_extract问题包括grep返回整个值的事实,而不仅仅是匹配的部分,并且str_extract似乎不能以矢量化方式工作。

Here is what I'm trying. 这是我正在尝试的。 I would like df$match to show alpha.alpha. 我想df$match来显示alpha.alpha. where the pattern exists and NA otherwise. 模式存在的地方,否则NA How can I show only the matched part? 如何只显示匹配的部分?

Also, how I can I replace [a-zA-Z] in R regex? 另外,我如何在R正则表达式中替换[a-zA-Z] Can I use a character class or a POSIX code like [:alpha:] ? 我可以使用像[:alpha:]这样的字符类或POSIX代码吗?

v1 <- c(1:4)
v2 <- c("_a.b._", NA, "_C.D._", "_ef_")
df <- data.frame(v1, v2, stringsAsFactors = FALSE)

df$match <- grepl("[a-zA-Z]\\.[a-zA-Z]\\.", df$v2)
df$match

#TRUE FALSE  TRUE FALSE

v2grep <- grep("[a-zA-Z]\\.[a-zA-Z]\\.", df$v2, value = TRUE)

df$match[df$match == TRUE] <- v2grep
df$match[df$match == FALSE] <- NA

df

#v1  v2      match
#1   _a.b._  _a.b._
#2   <NA>    <NA>
#3   _C.D._  _C.D._
#4   _ef_    <NA>

What I want: 我想要的是:

#v1  v2      match
#1   _a.b._  a.b.
#2   <NA>    <NA>
#3   _C.D._  C.D.
#4   _ef_    <NA>

4 Approaches... 4方法......

Here's 2 approaches in base as well as with rm_default(extract=TRUE) in the qdapRegex package I maintain and the stringi package. 这里有2个基本方法,以及我维护的qdapRegex包和stringi包中的rm_default rm_default(extract=TRUE)

unlist(sapply(regmatches(df[["v2"]], gregexpr("[a-zA-Z]\\.[a-zA-Z]\\.", df[["v2"]])), function(x){
        ifelse(identical(character(0), x), NA, x)
    })
)

## [1] "a.b." NA     "C.D." NA 

pat <- "(.*?)([a-zA-Z]\\.[a-zA-Z]\\.)(.*?)$"
df[["v2"]][!grepl(pat, df[["v2"]])] <- NA
df[["v2"]] <- gsub(pat, "\\2", df[["v2"]])

## [1] "a.b." NA     "C.D." NA

library(qdapRegex)
unlist(rm_default(df[["v2"]], pattern = "[a-zA-Z]\\.[a-zA-Z]\\.", extract = TRUE))

## [1] "a.b." NA     "C.D." NA 

library(stringi)
stri_extract_first_regex(df[["v2"]], "[a-zA-Z]\\.[a-zA-Z]\\.")

## [1] "a.b." NA     "C.D." NA 

Base R solution using regmatches , and regexpr which returns -1 if no regex match is found: 使用regmatches基本R解决方案,如果没有找到正则表达式匹配则返回-1 regexpr

r <- regexpr("[a-zA-Z]\\.[a-zA-Z]\\.", df$v2)
df$match <- NA
df$match[which(r != -1)] <- regmatches(df$v2, r)

#  v1     v2 match
#1  1 _a.b._  a.b.
#2  2   <NA>  <NA>
#3  3 _C.D._  C.D.
#4  4   _ef_  <NA>

One possible solution using both grepl and sub : 使用greplsub一种可能的解决方案:

# First, remove unwanted characters around pattern when detected
df$match <- sub(pattern = ".*([a-zA-Z]\\.[a-zA-Z]\\.).*", 
                replacement = "\\1", x = df$v2)
# Second, check if pattern is present; otherwise set to NA
df$match <- ifelse(grepl(pattern = "[a-zA-Z]\\.[a-zA-Z]\\.", x = df$match),
                   yes = df$match, no = NA)

Results 结果

df

#   v1     v2 match
# 1  1 _a.b._  a.b.
# 2  2   <NA>  <NA>
# 3  3 _C.D._  C.D.
# 4  4   _ef_  <NA>

Data 数据

v1 <- c(1:4)
v2 <- c("_a.b._", NA, "_C.D._", "_ef_")
df <- data.frame(v1, v2, stringsAsFactors = FALSE)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM