简体   繁体   English

提取字符串中的所有单词和字母簇,然后使用 R 中的 gsub() 使每个单词成为一个单独的数据

[英]Extracting all words and clusters of letters in a string and then making each word a seperate piece of data using gsub() in R

Say we have:假设我们有:

stringTest <- c("Here we have 4 words", "Here we have avwerfaf 4")

Expected output:预期输出:

"Here" "we" "have" "words" "Here" "we" "have" "avwerfaf" “这里”“我们”“有”“文字”“这里”“我们”“有”“avwerfaf”

I would like to use gsub(), but other methods are definitely excepted.我想使用 gsub(),但其他方法肯定除外。 Thanks Guys!谢谢你们!

You can use strsplit :您可以使用strsplit

result <- unlist(strsplit(stringTest, " |\\d"))
result[result != ""]
#> [1] "Here"     "we"       "have"     "words"    "Here"     "we"      
#> [7] "have"     "avwerfaf"

or if you prefer a one-liner:或者如果您更喜欢单线:

unlist(lapply(strsplit(stringTest, "\\W|\\d"), function(x) x[x != ""]))
library(tidyverse)

stringTest <- c("Here we have 4 words", "Here we have avwerfaf 4")

gsub(" \\d", replacement = "", stringTest) %>%
  str_split(pattern = " ") %>%
  unlist()

This falls under the "another approach" category.这属于“另一种方法”类别。 What you appear to be doing is tokenizing by words, dropping numbers.您似乎正在做的是通过单词进行标记,删除数字。

library(tokenizers)

unlist(tokenize_words(stringTest, lowercase = FALSE, strip_numeric = TRUE))

Which gives:这使:

[1] "Here"     "we"       "have"     "words"    "Here"     "we"       "have"     "avwerfaf"

If you are operating out of a data frame, something like this could be useful.如果您在数据框之外进行操作,这样的操作可能会很有用。

library(dplyr)
library(tidytext)

df <- tibble(description = stringTest)

df2 <- df %>% 
  rowid_to_column() %>% 
  unnest_tokens(word, description, to_lower = FALSE, strip_numeric = TRUE)

Which returns a new tibble:它返回一个新的小标题:

> df2
# A tibble: 8 x 2
  rowid word    
  <int> <chr>   
1     1 Here    
2     1 we      
3     1 have    
4     1 words   
5     2 Here    
6     2 we      
7     2 have    
8     2 avwerfaf

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM