简体   繁体   English

主题建模输出错误

[英]Having a wrong Output in topic modelling

I have tried topic modelling in python . 我已经尝试在python中进行主题建模。 But its displaying wrong output. 但是它显示错误的输出。 I have provided sample example and codes below. 我在下面提供了示例示例和代码。

## Documents

doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
doc2 = "My father spends a lot of time driving my sister around to dance practice."
doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."
doc4 = "Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better."
doc5 = "Health experts say that Sugar is not good for your lifestyle." 

# compile documents
doc_complete = [doc1, doc2, doc3, doc4, doc5]

from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
import string
stop = set(stopwords.words('english'))
exclude = set(string.punctuation) 
lemma = WordNetLemmatizer()
def clean(doc):
stop_free = "".join([i for i in doc.lower() if i not in stop])
punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
return normalized

doc_clean = [clean(doc) for doc in doc_complete] 

#Preparing Document Term Matrix
import gensim 

dictionary = corpora.Dictionary([doc_clean])
corpus = [dictionary.doc2bow(doc) for doc in [doc_clean]]

#Running LDA Model

Lda = gensim.models.ldamodel.LdaModel
ldamodel = Lda(corpus, num_topics=3, id2word = dictionary, passes=50)


print(ldamodel.print_topics(num_topics=3, num_words=3))

I am getting an output like following : 我得到如下输出:

[(0, u'0.200*cr ugge h rvng cue ncree re n bl preure + 0.200*fher pen l f e rvng er run nce prcce + 0.200*ee feel preure perfr well chl bu fher never ee rve er beer'), (1, u'0.200*helh exper h ugr n g fr ur lfele + 0.200*cr ugge h rvng cue ncree re n bl preure + 0.200*ee feel preure perfr well chl bu fher never ee rve er beer'), (2, u'0.200*fher pen l f e rvng er run nce prcce + 0.200*ugr b cnue er lke hve ugr bu n fher + 0.200*ee feel preure perfr well chl bu fher never ee rve er beer')]

I am wondering , what I have missed. 我想知道,我错过了什么。 Thanks 谢谢

The problem happens because of you didn't tokenize your documents before removing the stop words. 发生此问题的原因是,在删除停用词之前,您没有标记文档。 Instead you iterate through each character and remove the characters that are stopwords, eg "a", "i": 相反,您遍历每个字符并删除作为停用词的字符,例如“ a”,“ i”:

>>> from nltk.corpus import stopwords
>>> stop = stopwords.words('english')
>>> stop
[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', u'her', u'hers', u'herself', u'it', u'its', u'itself', u'they', u'them', u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this', u'that', u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be', u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing', u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until', u'while', u'of', u'at', u'by', u'for', u'with', u'about', u'against', u'between', u'into', u'through', u'during', u'before', u'after', u'above', u'below', u'to', u'from', u'up', u'down', u'in', u'out', u'on', u'off', u'over', u'under', u'again', u'further', u'then', u'once', u'here', u'there', u'when', u'where', u'why', u'how', u'all', u'any', u'both', u'each', u'few', u'more', u'most', u'other', u'some', u'such', u'no', u'nor', u'not', u'only', u'own', u'same', u'so', u'than', u'too', u'very', u's', u't', u'can', u'will', u'just', u'don', u'should', u'now', u'd', u'll', u'm', u'o', u're', u've', u'y', u'ain', u'aren', u'couldn', u'didn', u'doesn', u'hadn', u'hasn', u'haven', u'isn', u'ma', u'mightn', u'mustn', u'needn', u'shan', u'shouldn', u'wasn', u'weren', u'won', u'wouldn']
>>> doc = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
>>> "".join([i for i in doc.lower() if i not in stop])
'ugr  b  cnue.  er lke  hve ugr, bu n  fher.'

You should have processed the stopword removals like this: 您应该已经像这样处理了停用词删除操作:

>>> from nltk import word_tokenize
>>> doc = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
>>> " ".join([i for i in word_tokenize(doc.lower()) if i not in stop])
'sugar bad consume . sister likes sugar , father .'

See Stopword removal with NLTK 请参阅使用NLTK删除停用词


Actually, you pre-processsing pipeline can be simplified. 实际上,您可以简化管道的预处理过程。

>>> import gensim
>>> doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
>>> doc2 = "My father spends a lot of time driving my sister around to dance practice."
>>> doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."
>>> doc4 = "Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better."
>>> doc5 = "Health experts say that Sugar is not good for your lifestyle." 
>>> documents = [doc1, doc2, doc3, doc4, doc5]
>>> texts = map(gensim.utils.lemmatize,documents)
>>> texts
[['sugar/NN', 'be/VB', 'bad/JJ', 'consume/VB', 'sister/NN', 'like/VB', 'have/VB', 'sugar/NN', 'not/RB', 'father/NN'], ['father/NN', 'spend/VB', 'lot/NN', 'time/NN', 'drive/VB', 'sister/NN', 'dance/VB', 'practice/NN'], ['doctor/NN', 'suggest/VB', 'drive/VB', 'cause/VB', 'increased/JJ', 'stress/NN', 'blood/NN', 'pressure/NN'], ['sometimes/RB', 'feel/JJ', 'pressure/NN', 'perform/VB', 'well/RB', 'school/NN', 'father/NN', 'never/RB', 'seem/VB', 'drive/VB', 'sister/NN', 'do/VB', 'better/JJ'], ['health/NN', 'expert/NN', 'say/VB', 'sugar/NN', 'be/VB', 'not/RB', 'good/JJ', 'lifestyle/NN']]

Then to train the topic model: 然后训练主题模型:

>>> dictionary = gensim.corpora.Dictionary(texts)
>>> corpus = [dictionary.doc2bow(doc) for doc in texts]
>>> Lda = gensim.models.ldamodel.LdaModel
>>> ldamodel = Lda(corpus, num_topics=3, id2word = dictionary, passes=50)
>>> ldamodel.print_topics()
[(0, u'0.067*drive/VB + 0.067*pressure/NN + 0.067*stress/NN + 0.067*blood/NN + 0.067*doctor/NN + 0.067*increased/JJ + 0.067*cause/VB + 0.067*suggest/VB + 0.017*sister/NN + 0.017*father/NN'), (1, u'0.078*sugar/NN + 0.054*not/RB + 0.054*be/VB + 0.054*father/NN + 0.054*sister/NN + 0.031*do/VB + 0.031*seem/VB + 0.031*school/NN + 0.031*well/RB + 0.031*better/JJ'), (2, u'0.067*drive/VB + 0.067*sister/NN + 0.067*father/NN + 0.067*lot/NN + 0.067*practice/NN + 0.067*dance/VB + 0.067*spend/VB + 0.067*time/NN + 0.017*pressure/NN + 0.017*expert/NN')]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM