简体   繁体   English

是否有 R function 可以通过自定义字典进行清理

[英]Is there an R function to clean via a custom dictionary

I would like to use a custom dictionary (upwards of 400,000 words) when cleaning my data in R.在清理 R 中的数据时,我想使用自定义词典(超过 400,000 个单词)。 I already have the dictionary loaded as a large character list and I am trying to have it so that the content within my data (VCorpus) compromises of only the words in my dictionary.我已经将字典加载为一个大字符列表,并且我正在尝试使用它,以便我的数据(VCorpus)中的内容仅影响我字典中的单词。
For example:例如:

#[1] "never give up uouo cbbuk jeez"  

would become会成为

#[1*] "never give up"  

as the words "never","give",and "up" are all in the custom dictionary.因为“never”、“give”和“up”这些词都在自定义词典中。 I have previously tried the following:我以前尝试过以下方法:

#Reading the custom dictionary as a function
    english.words  <- function(x) x %in% custom.dictionary
#Filtering based on words in the dictionary
    DF2 <- DF1[(english.words(DF1$Text)),]

but my result is a character list with one word.但我的结果是一个单词的字符列表。 Any advice?有什么建议吗?

Since you use a dataframe you could try this:由于您使用 dataframe 您可以试试这个:

library(tidyverse)
library(tidytext)

dat<-tibble(text="never give up uouo cbbuk jeez")
words_to_keep<-c("never","give","up")

keep_function<-function(data,words_to_keep){
 data %>%
  unnest_tokens(word, text) %>% 
  filter(word %in% words_to_keep) %>%
  nest(text=word) %>%
  mutate(text = map(text, unlist), 
         text = map_chr(text, paste, collapse = " "))
  }

keep_function(dat,words_to_keep)

You can split the sentences into words, keep only words that are part of your dictionary and paste them in one sentence again.您可以将句子拆分为单词,只保留字典中的单词,然后再次将它们粘贴到一个句子中。

DF1$Text1 <- sapply(strsplit(DF1$Text, '\\s+'), function(x) 
                    paste0(Filter(english.words, x), collapse = ' '))

Here I have created a new column called Text1 with only english words, if you want to replace the original column you can save the output in DF1$Text .在这里,我创建了一个名为Text1的新列,只有英文单词,如果要替换原始列,可以将 output 保存在DF1$Text中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM