I'm trying to extract letter R or O that stands alone from multiple columns. By standalone, I mean R or O (i) separated by space or (ii) that is the only value in a cell. Here's a reproducible example. Suppose I want to extract standalone R or O from column X1
and X2
.
df <- data.frame(X1 = c( "EHO", "X 1 R","R"), X2 = c( "Y R E", "X A 1", "AER"), X3 = NA)
Here's desired outcome.
data.frame(X1 = c("", "R", "R"), X2 = c("R", "", ""))
Here's what I've tried so far. The first approach is problematic because R from AER and O from EHO is extracted (also R from "Y R E" is not extracted).
require(stringr) sapply(df[,1:2], function(x) ifelse( df$X3 %in% NA, str_extract(x, "\\s?[O|R]$"), X3))
So I've tried this, which solves above problem, but now it fails to extract R from df[3,1]
.
sapply(df[,1:2], function(x) ifelse( df$X3 %in% NA, str_extract(x, "(?![A-Z]+?)\\s?[O|R]$?"), X3))
How should I fix the pattern to get this?
You can use word boundaries:
sapply(df, stringr::str_extract, '\\b[RO]\\b')
# X1 X2 X3
#[1,] NA "R" NA
#[2,] "R" NA NA
#[3,] "R" NA NA
However, note that str_extract
will extract only one of "R"
or "O"
whichever comes first.
stringr::str_extract('EH R O', '\\b[RO]\\b')
#[1] "R"
If you want to extract both of them you might need to use str_extract_all
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.