I have some strings like below. I need to extract color part from the strings.
s1= 'color: red greenSize: 2 CountVerified Purchase'
s2= 'color: red greenVerified Purchase'
s3= 'color: red greenSize: 2 Count'
s4= 'color: red green'
I used str_replace
like below. It only works for s1
and s3
. Not for s2
and s4
.
str_replace(s1, 'color:\\s(.*)Size:\\s.*', '\\1')
Does anyone know how I can extract the colors from the string that work for ALL 4 cases?
Here is my attempt using regmatches
, along with the following regex pattern:
color: (\\S+) (\\S+)(?=Size|Verified|$)
This isolates the first and second colors, the second color's end being given by either the words Size
or Verified
, of the end of the string.
x <- c("color: red greenSize: 2 CountVerified Purchase",
"color: red greenVerified Purchase",
"color: red greenSize: 2 Count",
"color: red green")
sapply(x, function(x) {
result <- regmatches(x, regexec("color: (\\S+) (\\S+)(?=Size|Verified|$)", x, perl=TRUE))[[1]]
c(result[2], result[3])
})
This outputs (a bit messy):
color: red greenSize: 2 CountVerified Purchase
[1,] "red"
[2,] "green"
color: red greenVerified Purchase color: red greenSize: 2 Count
[1,] "red" "red"
[2,] "green" "green"
color: red green
[1,] "red"
[2,] "green"
Is it just me or are all those colors in lowercase? If this happens to be the case, you could simply do:
pattern <- "color:\\s*([a-z ]+).*"
gsub(pattern, "\\1", your_strings_here)
See a demo on regex101.com .
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.