简体   繁体   中英

Remove words per year in a corpus

I am working with a corpus with speeches spanning several years (aggregated to person-year level). I want to remove words that occur less than 4 times in a year (not remove it for the whole corpus, but only for the year in which it does not meet the threshold).

I have tried the following:

DT$text <- ifelse(grepl("1998", DT$session), mgsub(DT$text, words_remove_1998, ""), DT$text)

and 

DT$text <- ifelse(grepl("1998", DT$session), str_remove_all(DT$text, words_remove_1998), DT$text)

and 

DT$text <- ifelse(grepl("1998", DT$session), removeWords(DT$text, words_remove_1998), DT$text)

and

DT$text <- ifelse(grepl("1998", DT$session), drop_element(DT$text, words_remove_1998), DT$text)

However, none seem to work. Mgsub just substitutes the whole speech with "" for 1998, whilst the other options give error messages. The reason that removeWords does not work is that my words_remove_1998 vector is too large. I have tried to split the word vector and loop over the words (see code below), but R does not appear to like this (running forever).

group <- 100
n <- length(words_remove_1998)
r <- rep(1:ceiling(n/group),each=group)[1:n]
d <- split(words_remove_1998,r)

for (i in 1:length(d)) {
  DT$text <- ifelse(grepl("1998", DT$session), removeWords(DT$text, c(paste(d[[i]]))), DT$text)
}

Any suggestions for how to solve this?

Thank you for your help!

Reproducible example:

text <- rbind(c("i like ice cream"), c("banana ice cream is my favourite"), c("ice cream is not my thing"))
name <- rbind(c("Arnold Ford"), c("Arnold Ford"), c("Leslie King"))
session <- rbind("1998", "1999", "1998")

DT <- cbind(name, session, text)

words_remove_1998 <- c("like", "ice", "cream")

newtext <- rbind(c("i"), c("banana ice cream is my favourite"), c("is not my thing"))
DT <- cbind(DT, newtext)

My real word vector that I want removed contains 30k elements.

I ended up not using any wrappings, as none of them could handle the size of the data. Insted I did it the old-fashioned and simple way; separate the text into several rows, count the occurences of each word per session (year) and person, then remove the rows corresponding to less than a threshold (same limit as I used to identify the vector with words I wanted to remove). Lastly, I aggregate the data back to it's initial level (person-year).

This only words because I am removing words according to a threshold. If I had a list of words to remove that I could not remove in this way, I would have been in more trouble.

DT_separate <- separate_rows(DT, text)


df <- DT_separate %>%
  dplyr::group_by(session, text) %>%
  dplyr::mutate(count = dplyr::n())

df <- df[df$count >5, ]

df <- aggregate(
  text ~ x,      #where x is a person-year id
  data=df, 
  FUN=paste, collapse=' '
)

names(df)[names(df) == 'text'] <- 'text2'

DT <- left_join(DT, df, by="x")

DT$text <- DT$text2
DT <- DT[, !(colnames(DT) %in% c("text2"))]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM