[英]How to implement Latent Dirichlet Allocation to give bigrams/trigrams in topics instead of unigrams
[英]Capture bigram topics instead of unigrams using latent dirichlet allocat
我尽量让喜欢尝试这个问题
LDA 原始输出
Uni-grams
topic1 -scuba,water,vapor,diving
topic2 -dioxide,plants,green,carbon
所需输出
Bi-gram topics
topic1 -scuba diving,water vapor
topic2 -green plants,carbon dioxide
还有这个答案
from nltk.util import ngrams
for doc in docs:
docs[doc] = docs[doc] + ["_".join(w) for w in ngrams(docs[doc], 2)]
有什么帮助我应该做哪些更新才能只有二元组?
仅创建带有 bigrams 的文档:
from nltk.util import ngrams
for doc in docs:
docs[doc] = ["_".join(w) for w in ngrams(docs[doc], 2)]
或二元组的具体方法:
from nltk.util import bigrams
for doc in docs:
docs[doc] = ["_".join(w) for w in bigrams(docs[doc])]
然后在texts
使用这些二元组的列表以供将来操作。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.