简体   繁体   中英

How do I filter rows on a subset of columns?

Tail of df:

          fruit Letter Points     A    B     C       D
16       cherry      P   7876 11.43 7.23 13.72 4.29.01
17 chili pepper      Q   7831 10.85 7.18 14.14 4.33.90
18   clementine      R   7827 11.07 7.24 14.19 4.56.52
19   cloudberry      S   7704 10.38 7.73 14.32       X
20      coconut      T   7634 11.21 7.12 13.25 4.57.92
21    cranberry      U   7346 10.88 6.65 13.80 4.32.50

This seems like a common question but all the answers I've seen are based on filtering either over one column or over all columns. Here, I want to remove rows which contain "X" in only columns A to D.

Based on previous answers, if I wanted to only filter on one column, I can do:

df <- df[!grepl("X", df$D),]

Which works fine, but I can only do this manually as I know a priori where the "X" is. As I want to filter on many dfs of the same format, I need a way to filter on columns A to D.

Intuitively I figured I could just expand the argument in grepl to include the columns I want to filter on:

df <- df[!grepl("X", df[,c("A","B","C","D")]),] or df1 <- df1[!grepl("X", df1[,4:7]),]

However this ends up removing rows which don't contain an "X", let alone a letter in the AD cols. I'm guessing this is because the grep family of functions don't accept multiple vectors?

Ideally I'd like a base solution as I'm stumped at something which should be easy to figure out.

Full df:

df <- structure(list(fruit = c("apple", "apricot", "avocado", "bell pepper", 
"bilberry", "blackberry", "blood orange", "blueberry", "boysenberry", 
"canary melon", "cantaloupe", "cherimoya", "chili pepper", "clementine", 
"cloudberry", "cranberry"), Letter = c("A", "B", "C", "E", "F", 
"G", "I", "J", "K", "M", "N", "O", "Q", "R", "S", "U"), Points = c(8900, 
8757, 8742, 8554, 8531, 8461, 8206, 8153, 8113, 8106, 8050, 8017, 
7831, 7827, 7704, 7346), A = c("10.54", "10.64", "10.69", "10.64", 
"10.76", "10.99", "10.81", "11.00", "10.84", "11.05", "10.72", 
"10.84", "10.85", "11.07", "10.38", "10.88"), B = c("8.03", "7.88", 
"7.78", "7.24", "7.92", "7.59", "7.68", "7.32", "7.37", "7.34", 
"7.18", "6.89", "7.18", "7.24", "7.73", "6.65"), C = c("16.68", 
"15.19", "14.14", "15.72", "14.50", "14.75", "15.64", "14.19", 
"15.09", "15.10", "14.66", "14.20", "14.14", "14.19", "14.32", 
"13.80"), D = c("4.42.33", "4.35.06", "4.35.59", "4.23.13", "4.23.23", 
"4.29.93", "4.48.64", "4.21.06", "4.30.12", "4.52.35", "5.00.38", 
"4.48.11", "4.33.90", "4.56.52", "X", "4.32.50")), row.names = c(1L, 
2L, 3L, 5L, 6L, 7L, 9L, 10L, 11L, 13L, 14L, 15L, 17L, 18L, 19L, 
21L), class = "data.frame")

We could loop through the columns of interest, check whether the values are equal to "X" (based on the data, it is an exact match), then Reduce the list of logical vector s to a single vector with | and use that to subset the data

df[!Reduce(`|`, lapply(df[c("A", "B", "C", "D")], `==`, "X")),]

or with grepl (if it is not an exact)

df[!Reduce(`|`, lapply(df[c("A", "B", "C", "D")], grepl, pattern = "X")),]

or use tidyverse

library(tidyverse)
df %>% 
   filter_at(vars(A:D), any_vars(!grepl('X', .)))

Depending on the structure of your data:

df[!grepl('X',do.call(paste,df[4:7])),]

should work.

If at all you have other values like 23X.4 for example and you want to maintain them, then you can use regex as shown below:

df[!grepl('(?m)^X$',do.call(paste,c(sep='\n',df[4:7])),perl = T),]
cols = c("A",  "B", "C", "D")
df[! rowSums(df[cols] == "X"), ]

This will remove rows from df where the value in any of cols is "X" (not contains "X" , as some other answers are doing).

Using dplyr to remove any rows where the value in any of the columns A, B, C or D is equal to 'X' looks like this:

library(dplyr)
filter_at(df, vars(A:D), any_vars(!. == 'X'))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM