简体   繁体   English

从字符向量中提取特定元素

[英]Extracting specific elements from a character vector

I have a character vector 我有一个角色矢量

a=c("Mom", "mother", "Alex", "Betty", "Prime Minister")

I want to extract words starting with "M" only (upper and lower both) 我只想提取以“ M”开头的单词(上下两个都)

How to do this? 这个怎么做?

I have tried using grep() , sub() and other variants of this function but I am not getting it right. 我试过使用grep()sub()和此函数的其他变体,但我做对了。

I expect the output to be a character vector of "Mom" and "mother" 我希望输出是“妈妈”和“母亲”的字符向量

a[startsWith(toupper(a), "M")]

plain grep will also do just fine 普通grep也可以

grep( "^m", a, ignore.case = TRUE, value = TRUE )
#[1] "Mom"    "mother"

benchmarks 基准
tom's answer (startsWith) is the winner, but there is some room for improvement (check startsWith2 's code) 汤姆的答案(startsWith)是赢家,但仍有一些改进的余地(请查看startsWith2的代码)

microbenchmark::microbenchmark(
  substr = a[substr(a, 1, 1) %in% c("M", "m")],
  grepl = a[grepl("^[Mm]", a)],
  grep = grep( "^m", a, ignore.case = TRUE, value = TRUE ),
  stringr = unlist(stringr::str_extract_all(a,regex("^M.*",ignore_case = T))),
  startsWith1 = a[startsWith(toupper(a), "M")],
  startsWith2= a[startsWith(a, c("M", "m"))]
)


# Unit: nanoseconds
#        expr   min      lq     mean median    uq    max neval
#      substr  1808  2411.0  3323.19   3314  3917   8435   100
#       grepl  3916  4218.0  5438.06   4820  6930   8436   100
#        grep  3615  4368.5  5450.10   4820  6929  19582   100
#     stringr 50913 53023.0 55764.10  54529 55132 174432   100
# startsWith1  1506  2109.0  2814.11   2711  3013  17474   100
# startsWith2   602  1205.0  1410.17   1206  1507   3013   100

Use grepl , with the pattern ^[Mm] : 使用grepl ,其模式为^[Mm]

a[grepl("^[Mm]", a)]

[1] "Mom"    "mother"

Here is what the pattern ^[Mm] means: 这是模式^[Mm]含义:

^      from the start of the string
[Mm]   match either a lowercase or uppercase letter M

The grepl function works by just asserting that the input pattern matches at least once, so we don't need to be concerned with the rest of the string. grepl函数的工作原理是断言输入模式至少匹配一次,因此我们不必关心字符串的其余部分。

Using stringr 使用stringr

 library(stringr)
   unlist(str_extract_all(a,regex("^M.*",ignore_case = T)))



[1] "Mom"    "mother"

substr is a very tractable base R function: substr是一个非常易于处理的基本R函数:

a[substr(a, 1, 1) %in% c("M", "m")]

# [1] "Mom"    "mother"

And since you mentioned sub() then you could do (not necessarily recommended though): 而且由于您提到了sub()所以您可以这样做(不过不一定建议这样做):

a[sub("(.).*", "\\1", a) %in% c("M", "m")]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM