提取字符串中的所有单词和字母簇，然后使用 R 中的 gsub() 使每个单词成为一个单独的数据

Question

Say we have:假设我们有：

stringTest <- c("Here we have 4 words", "Here we have avwerfaf 4")

Expected output:预期输出：

"Here" "we" "have" "words" "Here" "we" "have" "avwerfaf" “这里”“我们”“有”“文字”“这里”“我们”“有”“avwerfaf”

I would like to use gsub(), but other methods are definitely excepted.我想使用 gsub()，但其他方法肯定除外。 Thanks Guys!谢谢你们！

Answer 1

You can use strsplit :您可以使用strsplit ：

result <- unlist(strsplit(stringTest, " |\\d"))
result[result != ""]
#> [1] "Here"     "we"       "have"     "words"    "Here"     "we"      
#> [7] "have"     "avwerfaf"

or if you prefer a one-liner:或者如果您更喜欢单线：

unlist(lapply(strsplit(stringTest, "\\W|\\d"), function(x) x[x != ""]))

Answer 2

library(tidyverse)

stringTest <- c("Here we have 4 words", "Here we have avwerfaf 4")

gsub(" \\d", replacement = "", stringTest) %>%
  str_split(pattern = " ") %>%
  unlist()

Answer 3

This falls under the "another approach" category.这属于“另一种方法”类别。 What you appear to be doing is tokenizing by words, dropping numbers.您似乎正在做的是通过单词进行标记，删除数字。

library(tokenizers)

unlist(tokenize_words(stringTest, lowercase = FALSE, strip_numeric = TRUE))

Which gives:这使：

[1] "Here"     "we"       "have"     "words"    "Here"     "we"       "have"     "avwerfaf"

If you are operating out of a data frame, something like this could be useful.如果您在数据框之外进行操作，这样的操作可能会很有用。

library(dplyr)
library(tidytext)

df <- tibble(description = stringTest)

df2 <- df %>% 
  rowid_to_column() %>% 
  unnest_tokens(word, description, to_lower = FALSE, strip_numeric = TRUE)

Which returns a new tibble:它返回一个新的小标题：

> df2
# A tibble: 8 x 2
  rowid word    
  <int> <chr>   
1     1 Here    
2     1 we      
3     1 have    
4     1 words   
5     2 Here    
6     2 we      
7     2 have    
8     2 avwerfaf

提取字符串中的所有单词和字母簇，然后使用 R 中的 gsub() 使每个单词成为一个单独的数据

问题描述

3 个解决方案

解决方案1
2 2020-02-27 20:56:19

解决方案2
0 2020-02-27 20:57:18

解决方案3
0 2020-02-27 21:00:17

提取字符串中的所有单词和字母簇，然后使用 R 中的 gsub() 使每个单词成为一个单独的数据

问题描述

3 个解决方案

解决方案1 2 2020-02-27 20:56:19

解决方案2 0 2020-02-27 20:57:18

解决方案3 0 2020-02-27 21:00:17

解决方案1
2 2020-02-27 20:56:19

解决方案2
0 2020-02-27 20:57:18

解决方案3
0 2020-02-27 21:00:17