简体   繁体   中英

R topic modeling - lda command 'lexicalize' giving unexpected results

I am using the 'lda' package in R to perform a topic model analysis of a corpus (let's call it 'corpusB'). I am preparing the corpus for the analysis by first using the command 'lexicalize', which returns a term-document matrix and, if not pre-specified, a vocabulary with unique tokens appearing in the corpus.

For research purposes, I want to lexicalize the corpus using a vocabulary inferred from another corpus (let's call it 'corpusA'), something that should be easily done. Yet, it is not working. Here is a sample of the code:

A <- lexicalize(corpusA) #the output of this command is just as expected
B <- lexicalize(corpusB, vocab = corpusA$vocab)

B$documents #let's see the term-document matrix
>>NULL #this is what I get

Any idea of why I am getting a null result? Strangely enough, the command works just fine if I am using simple character vectors rather than imported corpora.

A <- c("I have the very model of a modern major general")
B <- c("I have a major headache")

B1 <- lexicalize(B)
B1

$documents
$documents[[1]]
     [,1] [,2] [,3] [,4] [,5]
[1,]    0    1    2    3    4
[2,]    1    1    1    1    1


$vocab
[1] "i"        "have"     "a"        "major"    "headache"


A1 <- lexicalize(A, vocab = B1$vocab)
A1
[[1]]
     [,1] [,2] [,3] [,4]
[1,]    0    1    2    3
[2,]    1    1    1    1

A few more pieces of information that might be useful:

1) The corpus I am interested in (corpusB) contains 700mb of text, quite a considerable data;

2) Both corpora (B and A) are imported into R using the 'tm' package. Before the lexicalization, I use 'tm' also to remove punctuation, numbers, stopwords, to strip white spaces and lower case.

Any help is very much appreciated!

lexicalize() expects a character vector of document lines to construct a corpus and vocabulary suitable for lda . tm corpus should be transformed to character vector before applying lexicalize()

texts <-data.frame(text=unlist(sapply(corpusA, `[`, "content")), stringsAsFactors=F)
l_corp <- lexicalize(texts$text)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM