[英]Replace a data frame column based on regex
I am trying to extract part of a column in a data frame using regular expressions. 我试图使用正则表达式提取数据框中的部分列。 Problems I am running into include the facts that
grep
returns the whole value, not just the matched part, and that str_extract
doesn't seem to work in a vectorized way. 我
str_extract
问题包括grep
返回整个值的事实,而不仅仅是匹配的部分,并且str_extract
似乎不能以矢量化方式工作。
Here is what I'm trying. 这是我正在尝试的。 I would like
df$match
to show alpha.alpha.
我想
df$match
来显示alpha.alpha.
where the pattern exists and NA
otherwise. 模式存在的地方,否则
NA
。 How can I show only the matched part? 如何只显示匹配的部分?
Also, how I can I replace [a-zA-Z]
in R regex? 另外,我如何在R正则表达式中替换
[a-zA-Z]
? Can I use a character class or a POSIX code like [:alpha:]
? 我可以使用像
[:alpha:]
这样的字符类或POSIX代码吗?
v1 <- c(1:4)
v2 <- c("_a.b._", NA, "_C.D._", "_ef_")
df <- data.frame(v1, v2, stringsAsFactors = FALSE)
df$match <- grepl("[a-zA-Z]\\.[a-zA-Z]\\.", df$v2)
df$match
#TRUE FALSE TRUE FALSE
v2grep <- grep("[a-zA-Z]\\.[a-zA-Z]\\.", df$v2, value = TRUE)
df$match[df$match == TRUE] <- v2grep
df$match[df$match == FALSE] <- NA
df
#v1 v2 match
#1 _a.b._ _a.b._
#2 <NA> <NA>
#3 _C.D._ _C.D._
#4 _ef_ <NA>
What I want: 我想要的是:
#v1 v2 match
#1 _a.b._ a.b.
#2 <NA> <NA>
#3 _C.D._ C.D.
#4 _ef_ <NA>
4 Approaches... 4方法......
Here's 2 approaches in base as well as with rm_default(extract=TRUE)
in the qdapRegex package I maintain and the stringi package. 这里有2个基本方法,以及我维护的qdapRegex包和stringi包中的rm_default
rm_default(extract=TRUE)
。
unlist(sapply(regmatches(df[["v2"]], gregexpr("[a-zA-Z]\\.[a-zA-Z]\\.", df[["v2"]])), function(x){
ifelse(identical(character(0), x), NA, x)
})
)
## [1] "a.b." NA "C.D." NA
pat <- "(.*?)([a-zA-Z]\\.[a-zA-Z]\\.)(.*?)$"
df[["v2"]][!grepl(pat, df[["v2"]])] <- NA
df[["v2"]] <- gsub(pat, "\\2", df[["v2"]])
## [1] "a.b." NA "C.D." NA
library(qdapRegex)
unlist(rm_default(df[["v2"]], pattern = "[a-zA-Z]\\.[a-zA-Z]\\.", extract = TRUE))
## [1] "a.b." NA "C.D." NA
library(stringi)
stri_extract_first_regex(df[["v2"]], "[a-zA-Z]\\.[a-zA-Z]\\.")
## [1] "a.b." NA "C.D." NA
Base R solution using regmatches
, and regexpr
which returns -1
if no regex match is found: 使用
regmatches
基本R解决方案,如果没有找到正则表达式匹配则返回-1
regexpr
:
r <- regexpr("[a-zA-Z]\\.[a-zA-Z]\\.", df$v2)
df$match <- NA
df$match[which(r != -1)] <- regmatches(df$v2, r)
# v1 v2 match
#1 1 _a.b._ a.b.
#2 2 <NA> <NA>
#3 3 _C.D._ C.D.
#4 4 _ef_ <NA>
One possible solution using both grepl
and sub
: 使用
grepl
和sub
一种可能的解决方案:
# First, remove unwanted characters around pattern when detected
df$match <- sub(pattern = ".*([a-zA-Z]\\.[a-zA-Z]\\.).*",
replacement = "\\1", x = df$v2)
# Second, check if pattern is present; otherwise set to NA
df$match <- ifelse(grepl(pattern = "[a-zA-Z]\\.[a-zA-Z]\\.", x = df$match),
yes = df$match, no = NA)
Results 结果
df
# v1 v2 match
# 1 1 _a.b._ a.b.
# 2 2 <NA> <NA>
# 3 3 _C.D._ C.D.
# 4 4 _ef_ <NA>
Data 数据
v1 <- c(1:4)
v2 <- c("_a.b._", NA, "_C.D._", "_ef_")
df <- data.frame(v1, v2, stringsAsFactors = FALSE)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.