r - retain rows when certain condition is met in column

Question

df <- data.frame(a = c(1:5), b=c("df_1_1","df_1_2","df_2_3","df_1_1","df_2_4"))

I would like to retain rows only when the 5 leftmost strings in column b are "_1_1". So in this case only row 1 and 4 will be retained.

Thanks

Answer 1

We can use grepl to match partial strings ie _1_1 at the end ( $ ) of the string to subset those rows in base R

subset(df, grepl('_1_1$', b))
#  a      b
#1 1 df_1_1
#4 4 df_1_1

Answer 2

Another base R option using subset + endsWith

> subset(df,endsWith(b,"_1_1"))
  a      b
1 1 df_1_1
4 4 df_1_1

Answer 3

First of all, subset should be used only interactively. Use standard subsetting: df$b and df[match_row,] when you want to subset.

Alternatives:

Typically, when you want to grep some string in base R, you use base::grep . If you do not need regexp, you can use fixed=TRUE for a bit faster function.

Here I will compare the @akrun grepl , grep , @ThomasIsCoding's endsWith (nice find!) and a solution from the stringr library.

Benchmarking:

df <- data.frame(a = c(1:5), b=c("df_1_1","df_1_2","df_2_3","df_1_1","df_2_4"))
match = "_1_1"

library("stringr")
library("microbenchmark")

microbenchmark(
    grepl("_1_1$", df$b),
    grep("_1_1", df$b),
    grep("_1_1", df$b, fixed=TRUE),
    endsWith(df$b, "_1_1"),
    stringr::str_ends(df$b, "_1_1")
    )

Results:

Unit: microseconds
                             expr    min      lq     mean  median      uq     max neval
             grepl("_1_1$", df$b)  8.903 10.5985 11.68160 11.4215 12.2405  30.818   100
               grep("_1_1", df$b)  9.020 10.0210 10.87328 10.8610 11.3625  15.101   100
 grep("_1_1", df$b, fixed = TRUE)  3.709  4.7350  5.22625  5.2385  5.7445   7.059   100
           endsWith(df$b, "_1_1")  2.049  2.7510  3.28577  3.1055  3.4460  23.906   100
           str_ends(df$b, "_1_1") 35.657 38.2500 41.69787 40.4805 41.9560 131.970   100

Summary:

grep and grepl are almost identical. Personally, I find grep a little bit more useful due to its flexibility so I am using it most of the time (eg, with value=TRUE you can return the matched value).

While stringr and stringi libraries are rightfully praised, str_ends is surprisingly slow compared to all build-in solutions. When you need flexibility and you find yourself constructing complex string matching, stringr might be the right choice, but base solutions are fine for most use-cases and in this case, much faster.

The build-in endsWith beats everything. It is a specialized and highly-optimized solution implemented in pure C.

A correct call (without subset ) is then:

df[endsWith(df$b, "_1_1"), ]

r - retain rows when certain condition is met in column

Question

3 answers

solution1
2 2021-02-10 22:43:30

solution2
1 2021-02-10 22:45:04

solution3
1 ACCPTED 2021-02-10 23:09:06

Alternatives:

Benchmarking:

Results:

Summary:

r - retain rows when certain condition is met in column

Question

3 answers

solution1 2 2021-02-10 22:43:30

solution2 1 2021-02-10 22:45:04

solution3 1 ACCPTED 2021-02-10 23:09:06

Alternatives:

Benchmarking:

Results:

Summary:

solution1
2 2021-02-10 22:43:30

solution2
1 2021-02-10 22:45:04

solution3
1 ACCPTED 2021-02-10 23:09:06