I would like to use a custom dictionary (upwards of 400,000 words) when cleaning my data in R. I already have the dictionary loaded as a large character list and I am trying to have it so that the content within my data (VCorpus) compromises of only the words in my dictionary.
For example:
#[1] "never give up uouo cbbuk jeez"
would become
#[1*] "never give up"
as the words "never","give",and "up" are all in the custom dictionary. I have previously tried the following:
#Reading the custom dictionary as a function
english.words <- function(x) x %in% custom.dictionary
#Filtering based on words in the dictionary
DF2 <- DF1[(english.words(DF1$Text)),]
but my result is a character list with one word. Any advice?
Since you use a dataframe you could try this:
library(tidyverse)
library(tidytext)
dat<-tibble(text="never give up uouo cbbuk jeez")
words_to_keep<-c("never","give","up")
keep_function<-function(data,words_to_keep){
data %>%
unnest_tokens(word, text) %>%
filter(word %in% words_to_keep) %>%
nest(text=word) %>%
mutate(text = map(text, unlist),
text = map_chr(text, paste, collapse = " "))
}
keep_function(dat,words_to_keep)
You can split the sentences into words, keep only words that are part of your dictionary and paste them in one sentence again.
DF1$Text1 <- sapply(strsplit(DF1$Text, '\\s+'), function(x)
paste0(Filter(english.words, x), collapse = ' '))
Here I have created a new column called Text1
with only english words, if you want to replace the original column you can save the output in DF1$Text
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.