Error message when using Dplyr to filter with more than 3 levels to a factor

Question

I'm trying to filter some factors in Dplyer, but instead of manually writing out the ones I wanted like c("Blue","Green","White") etc, I figured something like

levels(df$factor.variable)[1:3]

might prove faster, but if try to select more than 2 variables using the following code then I get the error message "longer object length is not a multiple of shorter object length" and a big chunk of the data doesn't come through. With my dummy data below, 2/3 of the data disappears.

a <- 1:20
b <- rep(c("Blue", "Green", "White", "Grey"),5)
df <- data.frame(Numbers=a, colours=b)
df %>% 
  select(Numbers, colours) %>% 
  filter(colours==levels(df$colours)[1:3])

Note that if you only select 1 or 2 of the levels above (as in [1] or [1:2], not [1:3]), then the problem doesn't occur. Also if I remove one of the colours (factors) then I don't have the problem anymore.

a <- 1:15
b <- rep(c("Blue", "Green", "White"),5)
df <- data.frame(Numbers=a, colours=b)
df %>% 
  select(Numbers, colours) %>% 
  filter(colours==levels(df$colours)[1:3])

What objects have longer/shorter lengths? And why does 2/3 of the data disappear?

Answer 1

You were making mistake in dplyr. Instead of == use %in% solved the error.

a <- 1:20
b <- rep(c("Blue", "Green", "White", "Grey"),5)
df <- data.frame(Numbers=a, colours=b)
str(df)

df2<- df %>% 
  select(Numbers, colours) %>% 
  filter(colours %in% levels(df$colours)[1:3])

Answer 2

It's actually not a dplyr issue.

As others mentioned, a == b checks whether each pair of elements is identical, ie a[1] == b[1] , a[2] == b[2] , and so on. (Take a look at ?Comparison .) You're comparing vectors of unequal lengths and with lengths that don't lend themselves to recycling one to fit the other, which is the reason for the warning you got.

Instead, a %in% b checks whether each element in a exists somewhere in b , and returns true or false for each element in a .

To illustrate with your data:

library(dplyr)

a <- 1:20
b <- rep(c("Blue", "Green", "White", "Grey"),5)
df <- data.frame(Numbers=a, colours=b)

In the a %in% b representation, this is your b :

levels(df$colours)[1:3]
#> [1] "Blue"  "Green" "Grey"

Checking for each element of colours being in that set of values yields a logical vector:

df$colours %in% levels(df$colours)[1:3]
#>  [1]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE
#> [12]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE

The base R version of dplyr::filter is like this, taking the elements of df$colours for which the previous operation yields TRUE :

df$colours[df$colours %in% levels(df$colours)[1:3]]
#>  [1] Blue  Green Grey  Blue  Green Grey  Blue  Green Grey  Blue  Green
#> [12] Grey  Blue  Green Grey 
#> Levels: Blue Green Grey White

In dplyr , non-standard evaluation drops the need for df$ , but you're doing essentially the same thing within dplyr::filter : finding whether each element of colours is in the subset of values levels(colours)[1:3] , and then filtering for only those rows corresponding to a TRUE .

df %>%
  filter(colours %in% levels(colours)[1:3])
#>    Numbers colours
#> 1        1    Blue
#> 2        2   Green
#> 3        4    Grey
#> 4        5    Blue
#> 5        6   Green
#> 6        8    Grey
#> 7        9    Blue
#> 8       10   Green
#> 9       12    Grey
#> 10      13    Blue
#> 11      14   Green
#> 12      16    Grey
#> 13      17    Blue
#> 14      18   Green
#> 15      20    Grey

Error message when using Dplyr to filter with more than 3 levels to a factor

Question

2 answers

solution1
0 2018-09-27 09:43:35

solution2
0 2018-09-27 13:57:54

Error message when using Dplyr to filter with more than 3 levels to a factor

Question

2 answers

solution1 0 2018-09-27 09:43:35

solution2 0 2018-09-27 13:57:54

solution1
0 2018-09-27 09:43:35

solution2
0 2018-09-27 13:57:54