I have a dataframe called mydf, simplified as below:
mydf
var1 var2
abc_color1_location1_number1 1000
xyz_color1_location1_number1 100
asd_color2_location2_number1 900
qwe_color1_location1_number2 200
sdf_color2_location1_number2 1100
qwerrrr_ahjkkk_asdfgggg 234
sdf_color1_location2_number1 3577
abc_color1_location3_number1 86544
I want to subset the dataset flexibly based on var1 For example:
pattern <- c("abc", "color1", "number1")
newmydf <- mydf[grep(paste("_",paste(pattern,collapse="_|_"),"_",sep=""),mydf$var1,ignore.case=T),]
My expected result:
newmydf
var1 var2
abc_color1_location1_number1 1000
However, the resulted dataframe was only being subset with pattern "abc" and "color1" only, while I want all patterns should be considered. Can anyone please help me in this case?
Many thanks in advance!
With kind regards,
If you want all the elements of pattern
to be considered, then something like this might help:
pattern <- c("abc", "color1", "number1")
alltrue <- rowSums(sapply(pattern, function(x) grepl(pattern = x, mydf$var1))) == 3
mydf[alltrue, ]
# var1 var2
#1 abc_color1_location1_number1 1000
#8 abc_color1_location3_number1 86544
Essentially sapply
will run grepl
for each one of the pattern elements and then only use those ones where all grepls are TRUE
.
A solution uses tidyverse
and stringr
. mydf2
is the final output.
The find_match
is a user-defined function, which can return a vecotr with TRUE
or FALSE
to see if all the words in pattern
are found.
By applying the find_match
function, we can filter
the data frame based on the results.
library(tidyverse)
library(stringr)
find_match <- function(Col, pattern){
m <- map(pattern, str_detect, string = Col)
names(m) <- paste("Word", pattern)
m2 <- as_data_frame(m)
results <- rowSums(m2) == length(pattern)
return(results)
}
mydf2 <- mydf %>% filter(find_match(var1, pattern))
mydf2
var1 var2
1 abc_color1_location1_number1 1000
2 abc_color1_location3_number1 86544
# Create mydf
mydf <- read.table(text = "var1 var2
abc_color1_location1_number1 1000
xyz_color1_location1_number1 100
asd_color2_location2_number1 900
qwe_color1_location1_number2 200
sdf_color2_location1_number2 1100
qwerrrr_ahjkkk_asdfgggg 234
sdf_color1_location2_number1 3577
abc_color1_location3_number1 86544",
header = TRUE, stringsAsFactors = FALSE)
# Define the pattern
pattern <- c("abc", "color1", "number1")
An alternative approach is to strsplit
on _
and use all(... %in% ...)
keep <- sapply(strsplit(mydf$var1, "_"), function(x) all(pattern %in% x))
df[keep,]
Output
var1 var2
1 abc_color1_location1_number1 1000
8 abc_color1_location3_number1 86544
Data
df <- structure(list(var1 = c("abc_color1_location1_number1", "xyz_color1_location1_number1",
"asd_color2_location2_number1", "qwe_color1_location1_number2",
"sdf_color2_location1_number2", "qwerrrr_ahjkkk_asdfgggg", "sdf_color1_location2_number1",
"abc_color1_location3_number1"), var2 = c(1000L, 100L, 900L,
200L, 1100L, 234L, 3577L, 86544L)), .Names = c("var1", "var2"
), class = "data.frame", row.names = c(NA, -8L))
pattern <- c("abc", "color1", "number1")
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.