I have the following data frame containing characters and numbers, and NA:
df <- data.frame(a=c("notfound","NOT FOUND","NOT FOUND"), b=c(NA,"NOT FOUND","NOT FOUND"), c=c("not found",2,3), d=c("not found","NOT FOUND","NOT FOUND"), e=c("234","NOT FOUND",NA))
abcde 1 notfound <NA> not found not found 234 2 NOT FOUND NOT FOUND 2 NOT FOUND NOT FOUND 3 NOT FOUND NOT FOUND 3 NOT FOUND <NA>
I would like to remove all the columns where all the entries are "not found", "NOT found", "NOT FOUND" "notfound". basically if tolower(gsub(" ","",df)=="notfound")
. It seems like this operation does not work on data frames. Are there any alternatives?
The desired output would be:
de 1 not found 234 2 2 NOT FOUND 3 3 <NA>
You can use grepl
with a regular expression to search for strings matching that expression and keep only those columns where some elements don't show a match (indicated by FALSE
grepl
output) so that the number of matches for that column is less than nrow(df)
. This pattern matches strings that start with "not" and end with "found", and grepl
is set to be case-insensitive.
is_nf <-
sapply(df, grepl, pattern = '(?=^not).*found$',
perl = TRUE, ignore.case = TRUE)
df[colSums(is_nf) < nrow(df)]
# b c e
# 1 <NA> not found 234
# 2 NOT FOUND 2 NOT FOUND
# 3 NOT FOUND 3 <NA>
I'm guessing you'd also want to remove columns where the only non "not found" is NA.
is_na <- is.na(df)
df[colSums(is_nf | is_na) < nrow(df)]
# c e
# 1 not found 234
# 2 2 NOT FOUND
# 3 3 <NA>
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.