简体   繁体   中英

DocumentTermMatrix in R - sum of unique words for each row

I have a DocumentTermMatrix data_tags with 80.000 rows (groups of tags) and 900.000 columns, so 900.000 different tags. Through findFreqTerms(data_tags,2) I found out that about 462.000 tags are unique.

I want to make a function where 2 things happen: - delete these 462.000 columns, so that only tags with frequency 2 or more are left; - create 1 new column (Uniques): sum() for each row of all the unique tags that were removed.

     tag1 tag2 tag3 tag4
1       0    0    1    0
2       0    1    0    0
2       1    0    0    0
3       1    0    0    0
4       0    1    0    1
5       1    0    0    0
6       0    1    0    0

for example, tag 3 and tag4 are unqiue (only once appears in column):

     tag1 tag2 Uniques
1       0    0       1   
2       0    1       0    
2       1    0       0    
3       1    0       0    
4       0    1       1    
5       1    0       0    
6       0    1       0    

Thanks in advance for the help.

Maybe the following work for you.

library(slam)
library(tm)

set.seed(0)
terms <- sapply(LETTERS, function(letter) paste(rep.int(letter, 5), collapse = ""))
ndocs <- 5
doc_lengts <- sample(5:10, ndocs, TRUE)
docs <- lapply(doc_lengts, function(doc_len) sample(terms, doc_len, TRUE))

dtm <- DocumentTermMatrix(Corpus(VectorSource(docs)))
as.matrix(dtm)

## delete coloms so that only terms with frequency >= 2 are left
## here the function col_sums from the slam package helps
b <- col_sums(dtm) >= 2
dtm_deleted <- dtm[,!b]
dtm <- dtm[,b]
as.matrix(dtm)

## Uniques columns
as.matrix(dtm_deleted)
row_sums(dtm_deleted > 0)
dtm_new <- cbind(dtm, Uniques = row_sums(dtm_deleted > 0))
colnames(dtm_new)[ncol(dtm_new)] <- "Uniques"
as.matrix(dtm_new)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM