简体   繁体   中英

How to remove meaningless words from corpus?

I am new to R and am trying to remove meaningless words from corpus. I have a dataframe with emails in one column and the target variable in another. I'm trying to clean the email body data. I have used tm and qdap package for this. I have already gone through most of the other questions and tried the below example: Remove meaningless words from corpus in R The problem I am encountering is when I want to remove unwanted tokens (which are not dictionary words) from corpus, I am getting an error.

library(qdap)
library(tm)

corpus = Corpus(VectorSource(Email$Body))
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeNumbers)
corpus = tm_map(corpus, stripWhitespace)

corpus = tm_map(corpus, stemDocument)

tdm = TermDocumentMatrix(corpus)
all_tokens = findFreqTerms(tdm,1)
tokens_to_remove = setdiff(all_tokens, GradyAugmented)
corpus <- tm_map(corpus, content_transformer(removeWords), tokens_to_remove)

By running the above line of code I am getting below error.

  invalid regular expression '(*UCP)\b(zyx|zyer|zxxxxxâ|zxxxxx|zwischenzeit|zwei|zvolen|zverejneni|zurã|zum|zstepswc|zquez|zprã|zorunlulu|zona|zoho|znis|zmir|zlf|zink|zierk|zhou|zhodnoteni|zgyã|zgã|zfs|zfbeswstat|zerust|zeroâ|zeppelinstr|zellerstrass|zeldir|zel|zdanska|zcfqc|zaventem|zarecka|zarardan|zaragoza|zaobchã|zamã|zakã|zaira|zahradnikova|zagorska|zagã|zachyti|zabih|zã|yusof|yukinobu|yui|ypg|ypaint|youtub|yoursid|youâ|yoshitada|yorkshir|yollayan|yokohama|yoganandam|yiewsley|yhlhjpz|yer|yeovil|yeni|yeatman|yazarina|yazaki|yaz|yasakt|yarm|yara|yannick|yanlislikla|yakar|yaiza|yabortslitem|yã|xxxxx|xxxxgbl|xuezi|xuefeng|xprn|xma|xlsx|xjchvnbbafeg|xiii|xii|xiaonan|xgb|xcede|wythenshaw|wys|wydzial|wydzia|wycomb|www|wuppert|wroclaw|wroc|wrightâ|wpisana|woustvil|wouldnâ|worthwhil|worsley|worri|worldwid|worldâ|workwear|worcestershir|worc|wootton|wooller|woodtec|woodsid|woodmansey|woodley|woodham|woodgat|wonâ|wolverhampton|wjodoyg|wjgfjiq|witti|witt|witkowski|wiss
In addition: Warning message:
In gsub(sprintf("(*UCP)\\b(%s)\\b", paste(sort(words, decreasing = TRUE),  :
  PCRE pattern compilation error
    'regular expression is too large'
    at ''

sample corpus for email:

[794] "c mailto sent march ne rntbci accountspay nmuk subject new sig plc item still new statement await retriev use link connect account connect account link work copi past follow text address bar top internet browser https od datainterconnect com sigd sigdodsaccount php p thgqmdz d dt s contact credit control contact experi technic problem visit http bau faq datainterconnect com sig make payment call autom credit debit card payment line sig may abl help improv cashflow risk manag retent recoveri contract disput via www sigfinancetool co uk websit provid detail uniqu award win servic care select third parti avail sig custom power" 

tokens_to_remove[1:10]
 [1] "advis"        "appli"        "atlassian"    "bosch"        "boschrexroth" "busi"        
 [7] "communic"     "dcen"         "dcgbsom"      "email" 

I want to remove all words which are otherwise meaningless in english ie c, mailto, ne, accountspay, nmuk, etc.

I would do it as following:

library("readtext")
library(quanteda)
library(dplyr)
mytext<- c("Carles werwa went to sadaf buy trsfr in the supermanket", 
           "Marta needs to werwa sadaf go to Jamaica") # My corpus
tokens_to_remove<-c("werwa" ,"sadaf","trsfr")                         # My dictionary
TokenizedText<-tokens(mytext, 
                        remove_punct = TRUE, 
                        remove_numbers = TRUE)            # Tokenizing the words. You can input an english dictionary
mytextClean<- lapply(TokenizedText, function(x) setdiff(x, tokens_to_remove))          # setting the difference between both

mytextClean
$text1
[1] "Carles"      "went"        "to"          "buy"         "in"          "the"         "supermanket"

$text2
[1] "Marta"   "needs"   "to"      "go"      "Jamaica"

Tokens_to_remove could just be also an english dictionary, and then instead of setdiff() you could just use intersect() .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM