简体   繁体   中英

flexible patterns for a factor variable in order to subset a dataframe

I have a dataframe called mydf, simplified as below:

mydf

var1                          var2
abc_color1_location1_number1  1000
xyz_color1_location1_number1  100
asd_color2_location2_number1  900
qwe_color1_location1_number2  200
sdf_color2_location1_number2  1100
qwerrrr_ahjkkk_asdfgggg       234  
sdf_color1_location2_number1  3577
abc_color1_location3_number1  86544

I want to subset the dataset flexibly based on var1 For example:

pattern <- c("abc", "color1", "number1")
newmydf <- mydf[grep(paste("_",paste(pattern,collapse="_|_"),"_",sep=""),mydf$var1,ignore.case=T),]

My expected result:

newmydf
var1                          var2
abc_color1_location1_number1  1000

However, the resulted dataframe was only being subset with pattern "abc" and "color1" only, while I want all patterns should be considered. Can anyone please help me in this case?

Many thanks in advance!

With kind regards,

If you want all the elements of pattern to be considered, then something like this might help:

pattern <- c("abc", "color1", "number1")
alltrue <- rowSums(sapply(pattern, function(x) grepl(pattern = x, mydf$var1))) == 3

mydf[alltrue, ]
#                          var1  var2
#1 abc_color1_location1_number1  1000
#8 abc_color1_location3_number1 86544

Essentially sapply will run grepl for each one of the pattern elements and then only use those ones where all grepls are TRUE .

A solution uses tidyverse and stringr . mydf2 is the final output.

The find_match is a user-defined function, which can return a vecotr with TRUE or FALSE to see if all the words in pattern are found.

By applying the find_match function, we can filter the data frame based on the results.

library(tidyverse)
library(stringr)

find_match <- function(Col, pattern){
  m <- map(pattern, str_detect, string = Col)
  names(m) <- paste("Word", pattern)
  m2 <- as_data_frame(m)
  results <- rowSums(m2) == length(pattern)
  return(results)
}

mydf2 <- mydf %>% filter(find_match(var1, pattern))
mydf2
                          var1  var2
1 abc_color1_location1_number1  1000
2 abc_color1_location3_number1 86544

Data Preparation

# Create mydf
mydf <- read.table(text = "var1                          var2
abc_color1_location1_number1  1000
                   xyz_color1_location1_number1  100
                   asd_color2_location2_number1  900
                   qwe_color1_location1_number2  200
                   sdf_color2_location1_number2  1100
                   qwerrrr_ahjkkk_asdfgggg       234  
                   sdf_color1_location2_number1  3577
                   abc_color1_location3_number1  86544",
                   header = TRUE, stringsAsFactors = FALSE)

# Define the pattern
pattern <- c("abc", "color1", "number1")

An alternative approach is to strsplit on _ and use all(... %in% ...)

keep <- sapply(strsplit(mydf$var1, "_"), function(x) all(pattern %in% x))
df[keep,]

Output

                          var1  var2
1 abc_color1_location1_number1  1000
8 abc_color1_location3_number1 86544

Data

df <- structure(list(var1 = c("abc_color1_location1_number1", "xyz_color1_location1_number1", 
"asd_color2_location2_number1", "qwe_color1_location1_number2", 
"sdf_color2_location1_number2", "qwerrrr_ahjkkk_asdfgggg", "sdf_color1_location2_number1", 
"abc_color1_location3_number1"), var2 = c(1000L, 100L, 900L, 
200L, 1100L, 234L, 3577L, 86544L)), .Names = c("var1", "var2"
), class = "data.frame", row.names = c(NA, -8L))

pattern <- c("abc", "color1", "number1")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM