简体   繁体   中英

r - retain rows when certain condition is met in column

df <- data.frame(a = c(1:5), b=c("df_1_1","df_1_2","df_2_3","df_1_1","df_2_4"))

I would like to retain rows only when the 5 leftmost strings in column b are "_1_1". So in this case only row 1 and 4 will be retained.

Thanks

We can use grepl to match partial strings ie _1_1 at the end ( $ ) of the string to subset those rows in base R

subset(df, grepl('_1_1$', b))
#  a      b
#1 1 df_1_1
#4 4 df_1_1

Another base R option using subset + endsWith

> subset(df,endsWith(b,"_1_1"))
  a      b
1 1 df_1_1
4 4 df_1_1

First of all, subset should be used only interactively. Use standard subsetting: df$b and df[match_row,] when you want to subset.

Alternatives:

Typically, when you want to grep some string in base R, you use base::grep . If you do not need regexp, you can use fixed=TRUE for a bit faster function.

Here I will compare the @akrun grepl , grep , @ThomasIsCoding's endsWith (nice find!) and a solution from the stringr library.

Benchmarking:

df <- data.frame(a = c(1:5), b=c("df_1_1","df_1_2","df_2_3","df_1_1","df_2_4"))
match = "_1_1"

library("stringr")
library("microbenchmark")

microbenchmark(
    grepl("_1_1$", df$b),
    grep("_1_1", df$b),
    grep("_1_1", df$b, fixed=TRUE),
    endsWith(df$b, "_1_1"),
    stringr::str_ends(df$b, "_1_1")
    )

Results:

Unit: microseconds
                             expr    min      lq     mean  median      uq     max neval
             grepl("_1_1$", df$b)  8.903 10.5985 11.68160 11.4215 12.2405  30.818   100
               grep("_1_1", df$b)  9.020 10.0210 10.87328 10.8610 11.3625  15.101   100
 grep("_1_1", df$b, fixed = TRUE)  3.709  4.7350  5.22625  5.2385  5.7445   7.059   100
           endsWith(df$b, "_1_1")  2.049  2.7510  3.28577  3.1055  3.4460  23.906   100
           str_ends(df$b, "_1_1") 35.657 38.2500 41.69787 40.4805 41.9560 131.970   100

Summary:

grep and grepl are almost identical. Personally, I find grep a little bit more useful due to its flexibility so I am using it most of the time (eg, with value=TRUE you can return the matched value).

While stringr and stringi libraries are rightfully praised, str_ends is surprisingly slow compared to all build-in solutions. When you need flexibility and you find yourself constructing complex string matching, stringr might be the right choice, but base solutions are fine for most use-cases and in this case, much faster.

The build-in endsWith beats everything. It is a specialized and highly-optimized solution implemented in pure C.

A correct call (without subset ) is then:

df[endsWith(df$b, "_1_1"), ]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM