简体   繁体   English

文件编号将如何影响Gensim LDA的结果?

[英]How will the document number affect the result of Gensim LDA?

I use three txt file to do a LDA project I try to separate these three txt file with two way The difference among the process is: 我使用三个txt文件来做一个LDA项目,我尝试用两种方法来分离这三个txt文件。过程之间的区别是:

docs = [[doc1.split(' ')], [doc2.split(' ')], [doc3.split(' ')]]
docs1 = [[''.join(i)] for i in re.split(r'\n{1,}', doc11)] + [[''.join(e)] for e in re.split(r'\n{1,}', doc22)] + [[''.join(t)] for t in re.split(r'\n{1,}', doc33)]    
dictionary = Dictionary(docs)
dictionary1 = Dictionary(docs1)
corpus = [dictionary.doc2bow(doc) for doc in docs]
corpus1 = [dictionary.doc2bow(doc) for doc in docs1]

And the document number is 并且文件编号是

len(corpus)
len(corpus1)
3
1329

But the lda model create a rubbish result in corpus but a relatively good result in corpus1 但LDA模型中创建一个垃圾结果corpus ,但一个比较好的结果corpus1

I use this model to train the document 我使用这种模型来训练文档

model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                    id2word=id2word,
                                    num_topics=10, 
                                    random_state=100,
                                    update_every=1,
                                    chunksize=100,
                                    passes=10,
                                    alpha='auto',
                                    per_word_topics=True)

The difference in the two model is the document number, everything else is the same 两种型号的区别在于文件编号,其他所有都相同

Why LDA create such a different result in this two model? 为什么LDA在这两个模型中会产生如此不同的结果?

If you study about LDA I think almost everywhere the first line is "LDA is good for large corpus whereas it doesn't work good for short text". 如果您研究LDA,那么我认为几乎所有地方的第一行都是“ LDA对大型语料库有利,而对短文本则无效”。 In your corpus only 3 documents are there whereas in corpus1 it's 1329 so definitely it's gonna produce accurate results for corpus1 在您的corpus只有3个文档,而在corpus1是1329,因此绝对可以为corpus1产生准确的结果

Another point is LDA works based on iterations and find random samples for training from documents, so when you have large corpus(more documents) it's most likely that every sample will be different as compared to same samples(few documents) and different samples can lead to more accurate results. 另一点是LDA基于迭代工作,并从文档中找到随机样本进行训练,因此,当您拥有大量语料库(更多文档)时,与相同样本(少量文档)相比,每个样本很可能会有所不同,并且不同样本可能导致以获得更准确的结果。

Hope this make sense. 希望这有意义。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM