简体   繁体   English

每年删除语料库中的单词

[英]Remove words per year in a corpus

I am working with a corpus with speeches spanning several years (aggregated to person-year level).我正在使用一个语料库,其演讲时间跨度数年(汇总到人年水平)。 I want to remove words that occur less than 4 times in a year (not remove it for the whole corpus, but only for the year in which it does not meet the threshold).我想删除一年内出现次数少于4次的词(不是整个语料库都删除它,而是只删除它没有达到阈值的那一年)。

I have tried the following:我尝试了以下方法:

DT$text <- ifelse(grepl("1998", DT$session), mgsub(DT$text, words_remove_1998, ""), DT$text)

and 

DT$text <- ifelse(grepl("1998", DT$session), str_remove_all(DT$text, words_remove_1998), DT$text)

and 

DT$text <- ifelse(grepl("1998", DT$session), removeWords(DT$text, words_remove_1998), DT$text)

and

DT$text <- ifelse(grepl("1998", DT$session), drop_element(DT$text, words_remove_1998), DT$text)

However, none seem to work.然而,似乎没有一个工作。 Mgsub just substitutes the whole speech with "" for 1998, whilst the other options give error messages. Mgsub 只是用 "" 代替 1998 的整个演讲,而其他选项给出错误消息。 The reason that removeWords does not work is that my words_remove_1998 vector is too large. removeWords 不起作用的原因是我的 words_remove_1998 向量太大。 I have tried to split the word vector and loop over the words (see code below), but R does not appear to like this (running forever).我试图分割词向量并循环遍历词(见下面的代码),但 R 似乎不喜欢这样(永远运行)。

group <- 100
n <- length(words_remove_1998)
r <- rep(1:ceiling(n/group),each=group)[1:n]
d <- split(words_remove_1998,r)

for (i in 1:length(d)) {
  DT$text <- ifelse(grepl("1998", DT$session), removeWords(DT$text, c(paste(d[[i]]))), DT$text)
}

Any suggestions for how to solve this?有关如何解决此问题的任何建议?

Thank you for your help!感谢您的帮助!

Reproducible example:可重现的例子:

text <- rbind(c("i like ice cream"), c("banana ice cream is my favourite"), c("ice cream is not my thing"))
name <- rbind(c("Arnold Ford"), c("Arnold Ford"), c("Leslie King"))
session <- rbind("1998", "1999", "1998")

DT <- cbind(name, session, text)

words_remove_1998 <- c("like", "ice", "cream")

newtext <- rbind(c("i"), c("banana ice cream is my favourite"), c("is not my thing"))
DT <- cbind(DT, newtext)

My real word vector that I want removed contains 30k elements.我要删除的真实词向量包含 30k 个元素。

I ended up not using any wrappings, as none of them could handle the size of the data.我最终没有使用任何包装,因为它们都无法处理数据的大小。 Insted I did it the old-fashioned and simple way; Insted 我用老式和简单的方式做到了; separate the text into several rows, count the occurences of each word per session (year) and person, then remove the rows corresponding to less than a threshold (same limit as I used to identify the vector with words I wanted to remove).将文本分成几行,计算每个会话(年)和每个人的每个单词的出现次数,然后删除对应于小于阈值的行(与我用来识别带有要删除的单词的向量相同的限制)。 Lastly, I aggregate the data back to it's initial level (person-year).最后,我将数据聚合回其初始水平(人年)。

This only words because I am removing words according to a threshold.这只是单词,因为我正在根据阈值删除单词。 If I had a list of words to remove that I could not remove in this way, I would have been in more trouble.如果我有一个我无法以这种方式删除的要删除的单词列表,我会遇到更多麻烦。

DT_separate <- separate_rows(DT, text)


df <- DT_separate %>%
  dplyr::group_by(session, text) %>%
  dplyr::mutate(count = dplyr::n())

df <- df[df$count >5, ]

df <- aggregate(
  text ~ x,      #where x is a person-year id
  data=df, 
  FUN=paste, collapse=' '
)

names(df)[names(df) == 'text'] <- 'text2'

DT <- left_join(DT, df, by="x")

DT$text <- DT$text2
DT <- DT[, !(colnames(DT) %in% c("text2"))]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM