删除字符串中所有包含标点（R）的单词

Question

如何（在R中）删除包含标点符号的字符串中的任何单词，而使单词不包含标点？

  test.string <- "I am:% a test+ to& see if-* your# fun/ction works o\r not"

  desired <- "I a see works not"

Answer 1

这是一种使用sub的方法，该方法似乎有效：

test.string <- "I am:% a test$ to& see if* your# fun/ction works o\r not"
gsub("[A-Za-z]*[^A-Za-z ]\\S*\\s*", "", test.string)

[1] "I a see works not"

这种方法是使用以下正则表达式模式：

[A-Za-z]*     match a leading letter zero or more times
[^A-Za-z ]    then match a symbol once (not a space character or a letter)
\\S*          followed by any other non whitespace character
\\s*          followed by any amount of whitespace

然后，我们只用空字符串替换，以删除其中包含一个或多个符号的单词。

Answer 2

您可以使用此正则表达式

(?<=\\s|^)[a-z0-9]+(?=\\s|$)

(?<=\\\\s|^) -正向后看，匹配之前应加空格或字符串开头。
[a-z0-9]+ -一次或多次匹配字母和数字，
(?=\\\\s|$) -匹配项后必须跟空格或字符串结尾

演示版

蒂姆的编辑：

该答案使用白名单方法，即确定OP 确实希望保留在其输出中的所有单词。 我们可以尝试使用上面给出的regex模式进行匹配，然后使用paste连接匹配向量：

test.string <- "I am:% a test$ to& see if* your# fun/ction works o\\r not"
result <- regmatches(test.string,gregexpr("(?<=\\s|^)[A-Za-z0-9]+(?=\\s|$)",test.string, perl=TRUE))[[1]]
paste(result, collapse=" ")

[1] "I a see works not"

Answer 3

这是其他几种方法

第一种方法：

str_split(test.string, " ", n=Inf) %>%  # spliting the line into words
unlist %>% 
.[!str_detect(., "\\W|\r")] %>%         # detect words without punctuation or \r
paste(.,collapse=" ")                   # collapse the words to get the line

第二种方法：

str_extract_all(test.string, "^\\w+|\\s\\w+\\s|\\w+$") %>% 
unlist %>% 
trimws() %>% 
paste(., collapse=" ")

^\\\\w+ -仅具有[a-zA-Z0-9_]并且也是字符串开头的单词
\\\\s\\\\w+\\\\s具有[a-zA-Z0-9_]且在单词前后有空格的单词
\\\\w+$ -具有[a-zA-Z0-9_]并且也是字符串结尾的单词

删除字符串中所有包含标点（R）的单词

问题描述

3 个解决方案

解决方案1
4 已采纳 2019-06-06 05:23:42

解决方案2
2 2019-06-06 05:20:30

解决方案3
0 2019-06-06 10:31:47

删除字符串中所有包含标点（R）的单词

问题描述

3 个解决方案

解决方案1 4 已采纳 2019-06-06 05:23:42

解决方案2 2 2019-06-06 05:20:30

解决方案3 0 2019-06-06 10:31:47

解决方案1
4 已采纳 2019-06-06 05:23:42

解决方案2
2 2019-06-06 05:20:30

解决方案3
0 2019-06-06 10:31:47