[英]Extracting all words and clusters of letters in a string and then making each word a seperate piece of data using gsub() in R
Say we have:假设我们有:
stringTest <- c("Here we have 4 words", "Here we have avwerfaf 4")
Expected output:预期输出:
"Here" "we" "have" "words" "Here" "we" "have" "avwerfaf" “这里”“我们”“有”“文字”“这里”“我们”“有”“avwerfaf”
I would like to use gsub(), but other methods are definitely excepted.我想使用 gsub(),但其他方法肯定除外。 Thanks Guys!
谢谢你们!
You can use strsplit
:您可以使用
strsplit
:
result <- unlist(strsplit(stringTest, " |\\d"))
result[result != ""]
#> [1] "Here" "we" "have" "words" "Here" "we"
#> [7] "have" "avwerfaf"
or if you prefer a one-liner:或者如果您更喜欢单线:
unlist(lapply(strsplit(stringTest, "\\W|\\d"), function(x) x[x != ""]))
library(tidyverse)
stringTest <- c("Here we have 4 words", "Here we have avwerfaf 4")
gsub(" \\d", replacement = "", stringTest) %>%
str_split(pattern = " ") %>%
unlist()
This falls under the "another approach" category.这属于“另一种方法”类别。 What you appear to be doing is tokenizing by words, dropping numbers.
您似乎正在做的是通过单词进行标记,删除数字。
library(tokenizers)
unlist(tokenize_words(stringTest, lowercase = FALSE, strip_numeric = TRUE))
Which gives:这使:
[1] "Here" "we" "have" "words" "Here" "we" "have" "avwerfaf"
If you are operating out of a data frame, something like this could be useful.如果您在数据框之外进行操作,这样的操作可能会很有用。
library(dplyr)
library(tidytext)
df <- tibble(description = stringTest)
df2 <- df %>%
rowid_to_column() %>%
unnest_tokens(word, description, to_lower = FALSE, strip_numeric = TRUE)
Which returns a new tibble:它返回一个新的小标题:
> df2
# A tibble: 8 x 2
rowid word
<int> <chr>
1 1 Here
2 1 we
3 1 have
4 1 words
5 2 Here
6 2 we
7 2 have
8 2 avwerfaf
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.