简体   繁体   English

如何从语料库中删除无意义的单词?

[英]How to remove meaningless words from corpus?

I am new to R and am trying to remove meaningless words from corpus. 我是R语言的新手,正在尝试从语料库中删除无意义的单词。 I have a dataframe with emails in one column and the target variable in another. 我有一个数据框,其中一列包含电子邮件,而另一列包含目标变量。 I'm trying to clean the email body data. 我正在尝试清除电子邮件正文数据。 I have used tm and qdap package for this. 我已经为此使用了tm和qdap软件包。 I have already gone through most of the other questions and tried the below example: Remove meaningless words from corpus in R The problem I am encountering is when I want to remove unwanted tokens (which are not dictionary words) from corpus, I am getting an error. 我已经经历了其他大多数问题,并尝试了以下示例: 从R中语料库中删除无意义的单词我遇到的问题是,当我想从语料库中删除不需要的标记(不是词典单词)时,错误。

library(qdap)
library(tm)

corpus = Corpus(VectorSource(Email$Body))
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeNumbers)
corpus = tm_map(corpus, stripWhitespace)

corpus = tm_map(corpus, stemDocument)

tdm = TermDocumentMatrix(corpus)
all_tokens = findFreqTerms(tdm,1)
tokens_to_remove = setdiff(all_tokens, GradyAugmented)
corpus <- tm_map(corpus, content_transformer(removeWords), tokens_to_remove)

By running the above line of code I am getting below error. 通过运行上面的代码行,我得到以下错误。

  invalid regular expression '(*UCP)\b(zyx|zyer|zxxxxxâ|zxxxxx|zwischenzeit|zwei|zvolen|zverejneni|zurã|zum|zstepswc|zquez|zprã|zorunlulu|zona|zoho|znis|zmir|zlf|zink|zierk|zhou|zhodnoteni|zgyã|zgã|zfs|zfbeswstat|zerust|zeroâ|zeppelinstr|zellerstrass|zeldir|zel|zdanska|zcfqc|zaventem|zarecka|zarardan|zaragoza|zaobchã|zamã|zakã|zaira|zahradnikova|zagorska|zagã|zachyti|zabih|zã|yusof|yukinobu|yui|ypg|ypaint|youtub|yoursid|youâ|yoshitada|yorkshir|yollayan|yokohama|yoganandam|yiewsley|yhlhjpz|yer|yeovil|yeni|yeatman|yazarina|yazaki|yaz|yasakt|yarm|yara|yannick|yanlislikla|yakar|yaiza|yabortslitem|yã|xxxxx|xxxxgbl|xuezi|xuefeng|xprn|xma|xlsx|xjchvnbbafeg|xiii|xii|xiaonan|xgb|xcede|wythenshaw|wys|wydzial|wydzia|wycomb|www|wuppert|wroclaw|wroc|wrightâ|wpisana|woustvil|wouldnâ|worthwhil|worsley|worri|worldwid|worldâ|workwear|worcestershir|worc|wootton|wooller|woodtec|woodsid|woodmansey|woodley|woodham|woodgat|wonâ|wolverhampton|wjodoyg|wjgfjiq|witti|witt|witkowski|wiss
In addition: Warning message:
In gsub(sprintf("(*UCP)\\b(%s)\\b", paste(sort(words, decreasing = TRUE),  :
  PCRE pattern compilation error
    'regular expression is too large'
    at ''

sample corpus for email: 电子邮件样本语料库:

[794] "c mailto sent march ne rntbci accountspay nmuk subject new sig plc item still new statement await retriev use link connect account connect account link work copi past follow text address bar top internet browser https od datainterconnect com sigd sigdodsaccount php p thgqmdz d dt s contact credit control contact experi technic problem visit http bau faq datainterconnect com sig make payment call autom credit debit card payment line sig may abl help improv cashflow risk manag retent recoveri contract disput via www sigfinancetool co uk websit provid detail uniqu award win servic care select third parti avail sig custom power" 

tokens_to_remove[1:10]
 [1] "advis"        "appli"        "atlassian"    "bosch"        "boschrexroth" "busi"        
 [7] "communic"     "dcen"         "dcgbsom"      "email" 

I want to remove all words which are otherwise meaningless in english ie c, mailto, ne, accountspay, nmuk, etc. 我想删除所有其他在英语中没有意义的单词,例如c,mailto,ne,accountpay,nmuk等。

I would do it as following: 我将按以下方式进行操作:

library("readtext")
library(quanteda)
library(dplyr)
mytext<- c("Carles werwa went to sadaf buy trsfr in the supermanket", 
           "Marta needs to werwa sadaf go to Jamaica") # My corpus
tokens_to_remove<-c("werwa" ,"sadaf","trsfr")                         # My dictionary
TokenizedText<-tokens(mytext, 
                        remove_punct = TRUE, 
                        remove_numbers = TRUE)            # Tokenizing the words. You can input an english dictionary
mytextClean<- lapply(TokenizedText, function(x) setdiff(x, tokens_to_remove))          # setting the difference between both

mytextClean
$text1
[1] "Carles"      "went"        "to"          "buy"         "in"          "the"         "supermanket"

$text2
[1] "Marta"   "needs"   "to"      "go"      "Jamaica"

Tokens_to_remove could just be also an english dictionary, and then instead of setdiff() you could just use intersect() . Tokens_to_remove也可以只是英语词典,然后可以使用intersect()代替setdiff() intersect()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM