删除以大写字母开头的单词

Question

How do I remove all words that begin with a capital letter from a dataset? 如何从数据集中删除所有以大写字母开头的单词？

For example: 例如：

d <- c("nice", "cat", "Cat", "Dog")

should result in c("nice", "cat") 应该导致c("nice", "cat")

(Yes, I looked online for a very long time before asking this question. I'm sure the answer is simple, but I cannot figure out the regex syntax for it.) （是的，在问这个问题之前，我在网上看了很长时间。我相信答案很简单，但是我无法弄清楚它的正则表达式语法。）

Answer 1

d[!substr(d,1,1) %in% LETTERS]

 [1] "nice" "cat"

I was curious which of the options would be faster. 我很好奇哪个选项会更快。 Here are timings using a string vector with 100,000 words and one with 1 million words. 以下是使用包含100,000个单词的字符串向量和包含一百万个单词的字符串向量的计时。 I also added an option using the stringi package as well as the additional options suggested by @Hugh. 我还使用stringi包以及@Hugh建议的其他选项添加了一个选项。

library(microbenchmark)
library(stringi)
library(data.table)
library(hutils)

set.seed(3)
d <- replicate(1e5, paste(sample(c(letters,LETTERS),sample(3:15,1)), collapse="")) 

microbenchmark(substr=d[!substr(d,1,1) %in% LETTERS],
               grepl=d[!grepl("^[A-Z]", d)],
               grepl_perl=d[!grepl("^[A-Z]", d, perl = TRUE)],
               grep=grep("^[A-Z]", d, invert = TRUE, value = TRUE),
               stri_detect=d[!stri_detect(d, regex="^[A-Z]")],
               hutils=d[substr(d, 0, 1) %notin% LETTERS],
               data.table=d[!substr(d,1,1) %chin% LETTERS], times=50)

 Unit: milliseconds expr min lq mean median uq max neval cld substr 19.34844 21.12396 23.66050 22.57122 25.83950 34.19547 50 ab grepl 25.07439 27.64983 30.31913 28.46804 31.40705 44.55779 50 d grepl_perl 19.90326 21.68584 25.45138 22.87602 25.09937 97.97515 50 bc grep 23.65844 26.01204 28.72596 27.35598 29.84097 57.92622 50 cd stri_detect 29.16854 30.56955 35.62350 32.13257 39.58317 68.51851 50 e hutils 19.08427 20.92759 22.73886 21.80824 23.56090 30.38251 50 ab data.table 17.26040 18.80886 21.23428 20.12133 21.63160 46.63104 50 a

set.seed(3)
d <- replicate(1e6, paste(sample(c(letters,LETTERS),sample(3:15,1)), collapse=""))

 Unit: milliseconds expr min lq mean median uq max neval cld substr 165.1537 179.1681 192.0479 186.7607 194.0660 331.4462 50 ab grepl 249.7070 260.9464 273.2971 275.0886 283.0254 302.8629 50 d grepl_perl 193.3224 200.4336 213.0202 209.6039 218.6055 362.2129 50 c grep 236.9711 252.3330 269.3016 272.6767 279.6774 375.7031 50 d stri_detect 264.5809 281.9253 291.0088 289.7235 301.6924 321.7426 50 e hutils 169.7349 179.9313 197.5622 190.5230 196.2092 346.2719 50 bc data.table 151.4252 160.2967 177.1575 167.4981 175.2483 310.1474 50 a

Answer 2

Find ( grep ) words that start ( ^ ) with upper-case letter ( [AZ] ), but return everything else ( invert = TRUE ) 查找（ grep ）以大写字母（ [AZ] ）开头（ ^ ），但返回其他所有内容的单词（ invert = TRUE ）

grep("^[A-Z]", c("nice", "cat", "Cat", "Dog"), invert = TRUE, value = TRUE)
# [1] "nice" "cat"

Answer 3

您可以使用grepl()生成逻辑索引，并将其用于子集（其中^表示字符串的开头）：

d[!grepl("^[A-Z]", d)]

删除以大写字母开头的单词

问题描述

3 个解决方案

解决方案1
5 2017-11-21 23:35:09

解决方案2
5 已采纳 2017-11-21 23:35:27

解决方案3
5 2017-11-21 23:35:40

删除以大写字母开头的单词

问题描述

3 个解决方案

解决方案1 5 2017-11-21 23:35:09

解决方案2 5 已采纳 2017-11-21 23:35:27

解决方案3 5 2017-11-21 23:35:40

解决方案1
5 2017-11-21 23:35:09

解决方案2
5 已采纳 2017-11-21 23:35:27

解决方案3
5 2017-11-21 23:35:40