简体   繁体   English

在 R 库(tm)中,我如何获得带下划线的 NGRAMS 输出

[英]In R library(tm) how do I do I get the NGRAMS output with an underscore

Below is my code where I am creating bigrams from text data.下面是我从文本数据创建二元组的代码。 The output I am getting is fine except that I need the field names to have an underscore so that I can use these as variables for a model.我得到的输出很好,只是我需要字段名称带有下划线,以便我可以将它们用作模型的变量。

text<- c("Since I love to travel, this is what I rely on every time.", 
        "I got the rewards card for the no international transaction fee", 
        "I got the rewards card mainly for the flight perks",
        "Very good card, easy application process, and no international 
transaction fee",
        "The customer service is outstanding!",
        "My wife got the rewards card for the gift cards and international 
transaction fee.She loves it") 
df<- data.frame(text) 


library(tm)
corpus<- Corpus(DataframeSource(df))
corpus<- tm_map(corpus, content_transformer(tolower))
corpus<- tm_map(corpus, removePunctuation)
corpus<- tm_map(corpus, removeWords, stopwords("english"))
corpus<- tm_map(corpus, stripWhitespace)


BigramTokenizer<-
  function(x)
    unlist(lapply(ngrams(words(x),2),paste,collapse=" "),use.names=FALSE)

dtm<- DocumentTermMatrix(corpus, control= list(tokenize= BigramTokenizer))

sparse<- removeSparseTerms(dtm,.80)
dtm2<- as.matrix(sparse)
dtm2

Here is what the output looks like:这是输出的样子:

    Terms
Docs got rewards international transaction rewards card transaction fee
   1           0                         0            0               0
   2           1                         1            1               1
   3           1                         0            1               0
   4           0                         1            0               1
   5           0                         0            0               0
   6           1                         1            1               0

How do I make the field names like got_rewards instead of got rewards我如何使字段名称像got_rewards而不是得到奖励

This is not a really tm specific question I guess.我猜这不是一个真正的tm特定问题。 Anyway, you can set collapse="_" in your code or modify the column names after the fact like so:无论如何,您可以在代码中设置collapse="_"或事后修改列名,如下所示:

colnames(dtm2) <- gsub(" ", "_", colnames(dtm2), fixed = TRUE)
dtm2
    Terms
Docs got_rewards international_transaction rewards_card transaction_fee
   1           0                         0            0               0
   2           1                         1            1               1
   3           1                         0            1               0
   4           0                         1            0               1
   5           0                         0            0               0
   6           1                         1            1               0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM