简体   繁体   中英

Is there an R function to clean via a custom dictionary

I would like to use a custom dictionary (upwards of 400,000 words) when cleaning my data in R. I already have the dictionary loaded as a large character list and I am trying to have it so that the content within my data (VCorpus) compromises of only the words in my dictionary.
For example:

#[1] "never give up uouo cbbuk jeez"  

would become

#[1*] "never give up"  

as the words "never","give",and "up" are all in the custom dictionary. I have previously tried the following:

#Reading the custom dictionary as a function
    english.words  <- function(x) x %in% custom.dictionary
#Filtering based on words in the dictionary
    DF2 <- DF1[(english.words(DF1$Text)),]

but my result is a character list with one word. Any advice?

Since you use a dataframe you could try this:

library(tidyverse)
library(tidytext)

dat<-tibble(text="never give up uouo cbbuk jeez")
words_to_keep<-c("never","give","up")

keep_function<-function(data,words_to_keep){
 data %>%
  unnest_tokens(word, text) %>% 
  filter(word %in% words_to_keep) %>%
  nest(text=word) %>%
  mutate(text = map(text, unlist), 
         text = map_chr(text, paste, collapse = " "))
  }

keep_function(dat,words_to_keep)

You can split the sentences into words, keep only words that are part of your dictionary and paste them in one sentence again.

DF1$Text1 <- sapply(strsplit(DF1$Text, '\\s+'), function(x) 
                    paste0(Filter(english.words, x), collapse = ' '))

Here I have created a new column called Text1 with only english words, if you want to replace the original column you can save the output in DF1$Text .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM