df <- data.frame(a = c(1:5), b=c("df_1_1","df_1_2","df_2_3","df_1_1","df_2_4"))
I would like to retain rows only when the 5 leftmost strings in column b are "_1_1". So in this case only row 1 and 4 will be retained.
Thanks
We can use grepl
to match partial strings ie _1_1
at the end ( $
) of the string to subset
those rows in base R
subset(df, grepl('_1_1$', b))
# a b
#1 1 df_1_1
#4 4 df_1_1
Another base R option using subset
+ endsWith
> subset(df,endsWith(b,"_1_1"))
a b
1 1 df_1_1
4 4 df_1_1
First of all, subset
should be used only interactively. Use standard subsetting: df$b
and df[match_row,]
when you want to subset.
Typically, when you want to grep some string in base R, you use base::grep
. If you do not need regexp, you can use fixed=TRUE
for a bit faster function.
Here I will compare the @akrun grepl
, grep
, @ThomasIsCoding's endsWith
(nice find!) and a solution from the stringr
library.
df <- data.frame(a = c(1:5), b=c("df_1_1","df_1_2","df_2_3","df_1_1","df_2_4"))
match = "_1_1"
library("stringr")
library("microbenchmark")
microbenchmark(
grepl("_1_1$", df$b),
grep("_1_1", df$b),
grep("_1_1", df$b, fixed=TRUE),
endsWith(df$b, "_1_1"),
stringr::str_ends(df$b, "_1_1")
)
Unit: microseconds
expr min lq mean median uq max neval
grepl("_1_1$", df$b) 8.903 10.5985 11.68160 11.4215 12.2405 30.818 100
grep("_1_1", df$b) 9.020 10.0210 10.87328 10.8610 11.3625 15.101 100
grep("_1_1", df$b, fixed = TRUE) 3.709 4.7350 5.22625 5.2385 5.7445 7.059 100
endsWith(df$b, "_1_1") 2.049 2.7510 3.28577 3.1055 3.4460 23.906 100
str_ends(df$b, "_1_1") 35.657 38.2500 41.69787 40.4805 41.9560 131.970 100
grep
and grepl
are almost identical. Personally, I find grep
a little bit more useful due to its flexibility so I am using it most of the time (eg, with value=TRUE
you can return the matched value).
While stringr
and stringi
libraries are rightfully praised, str_ends
is surprisingly slow compared to all build-in solutions. When you need flexibility and you find yourself constructing complex string matching, stringr
might be the right choice, but base solutions are fine for most use-cases and in this case, much faster.
The build-in endsWith
beats everything. It is a specialized and highly-optimized solution implemented in pure C.
A correct call (without subset
) is then:
df[endsWith(df$b, "_1_1"), ]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.