[英]Extracting specific elements from a character vector
I have a character vector 我有一个角色矢量
a=c("Mom", "mother", "Alex", "Betty", "Prime Minister")
I want to extract words starting with "M" only (upper and lower both) 我只想提取以“ M”开头的单词(上下两个都)
How to do this? 这个怎么做?
I have tried using grep()
, sub()
and other variants of this function but I am not getting it right. 我试过使用
grep()
, sub()
和此函数的其他变体,但我做对了。
I expect the output to be a character vector of "Mom" and "mother" 我希望输出是“妈妈”和“母亲”的字符向量
a[startsWith(toupper(a), "M")]
plain grep
will also do just fine 普通
grep
也可以
grep( "^m", a, ignore.case = TRUE, value = TRUE )
#[1] "Mom" "mother"
benchmarks 基准
tom's answer (startsWith) is the winner, but there is some room for improvement (check startsWith2
's code) 汤姆的答案(startsWith)是赢家,但仍有一些改进的余地(请查看
startsWith2
的代码)
microbenchmark::microbenchmark(
substr = a[substr(a, 1, 1) %in% c("M", "m")],
grepl = a[grepl("^[Mm]", a)],
grep = grep( "^m", a, ignore.case = TRUE, value = TRUE ),
stringr = unlist(stringr::str_extract_all(a,regex("^M.*",ignore_case = T))),
startsWith1 = a[startsWith(toupper(a), "M")],
startsWith2= a[startsWith(a, c("M", "m"))]
)
# Unit: nanoseconds
# expr min lq mean median uq max neval
# substr 1808 2411.0 3323.19 3314 3917 8435 100
# grepl 3916 4218.0 5438.06 4820 6930 8436 100
# grep 3615 4368.5 5450.10 4820 6929 19582 100
# stringr 50913 53023.0 55764.10 54529 55132 174432 100
# startsWith1 1506 2109.0 2814.11 2711 3013 17474 100
# startsWith2 602 1205.0 1410.17 1206 1507 3013 100
Use grepl
, with the pattern ^[Mm]
: 使用
grepl
,其模式为^[Mm]
:
a[grepl("^[Mm]", a)]
[1] "Mom" "mother"
Here is what the pattern ^[Mm]
means: 这是模式
^[Mm]
含义:
^ from the start of the string
[Mm] match either a lowercase or uppercase letter M
The grepl
function works by just asserting that the input pattern matches at least once, so we don't need to be concerned with the rest of the string. grepl
函数的工作原理是断言输入模式至少匹配一次,因此我们不必关心字符串的其余部分。
Using stringr
使用
stringr
library(stringr)
unlist(str_extract_all(a,regex("^M.*",ignore_case = T)))
[1] "Mom" "mother"
substr
is a very tractable base R function: substr
是一个非常易于处理的基本R函数:
a[substr(a, 1, 1) %in% c("M", "m")]
# [1] "Mom" "mother"
And since you mentioned sub()
then you could do (not necessarily recommended though): 而且由于您提到了
sub()
所以您可以这样做(不过不一定建议这样做):
a[sub("(.).*", "\\1", a) %in% c("M", "m")]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.