简体   繁体   中英

Having a wrong Output in topic modelling

I have tried topic modelling in python . But its displaying wrong output. I have provided sample example and codes below.

## Documents

doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
doc2 = "My father spends a lot of time driving my sister around to dance practice."
doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."
doc4 = "Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better."
doc5 = "Health experts say that Sugar is not good for your lifestyle." 

# compile documents
doc_complete = [doc1, doc2, doc3, doc4, doc5]

from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
import string
stop = set(stopwords.words('english'))
exclude = set(string.punctuation) 
lemma = WordNetLemmatizer()
def clean(doc):
stop_free = "".join([i for i in doc.lower() if i not in stop])
punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
return normalized

doc_clean = [clean(doc) for doc in doc_complete] 

#Preparing Document Term Matrix
import gensim 

dictionary = corpora.Dictionary([doc_clean])
corpus = [dictionary.doc2bow(doc) for doc in [doc_clean]]

#Running LDA Model

Lda = gensim.models.ldamodel.LdaModel
ldamodel = Lda(corpus, num_topics=3, id2word = dictionary, passes=50)


print(ldamodel.print_topics(num_topics=3, num_words=3))

I am getting an output like following :

[(0, u'0.200*cr ugge h rvng cue ncree re n bl preure + 0.200*fher pen l f e rvng er run nce prcce + 0.200*ee feel preure perfr well chl bu fher never ee rve er beer'), (1, u'0.200*helh exper h ugr n g fr ur lfele + 0.200*cr ugge h rvng cue ncree re n bl preure + 0.200*ee feel preure perfr well chl bu fher never ee rve er beer'), (2, u'0.200*fher pen l f e rvng er run nce prcce + 0.200*ugr b cnue er lke hve ugr bu n fher + 0.200*ee feel preure perfr well chl bu fher never ee rve er beer')]

I am wondering , what I have missed. Thanks

The problem happens because of you didn't tokenize your documents before removing the stop words. Instead you iterate through each character and remove the characters that are stopwords, eg "a", "i":

>>> from nltk.corpus import stopwords
>>> stop = stopwords.words('english')
>>> stop
[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', u'her', u'hers', u'herself', u'it', u'its', u'itself', u'they', u'them', u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this', u'that', u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be', u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing', u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until', u'while', u'of', u'at', u'by', u'for', u'with', u'about', u'against', u'between', u'into', u'through', u'during', u'before', u'after', u'above', u'below', u'to', u'from', u'up', u'down', u'in', u'out', u'on', u'off', u'over', u'under', u'again', u'further', u'then', u'once', u'here', u'there', u'when', u'where', u'why', u'how', u'all', u'any', u'both', u'each', u'few', u'more', u'most', u'other', u'some', u'such', u'no', u'nor', u'not', u'only', u'own', u'same', u'so', u'than', u'too', u'very', u's', u't', u'can', u'will', u'just', u'don', u'should', u'now', u'd', u'll', u'm', u'o', u're', u've', u'y', u'ain', u'aren', u'couldn', u'didn', u'doesn', u'hadn', u'hasn', u'haven', u'isn', u'ma', u'mightn', u'mustn', u'needn', u'shan', u'shouldn', u'wasn', u'weren', u'won', u'wouldn']
>>> doc = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
>>> "".join([i for i in doc.lower() if i not in stop])
'ugr  b  cnue.  er lke  hve ugr, bu n  fher.'

You should have processed the stopword removals like this:

>>> from nltk import word_tokenize
>>> doc = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
>>> " ".join([i for i in word_tokenize(doc.lower()) if i not in stop])
'sugar bad consume . sister likes sugar , father .'

See Stopword removal with NLTK


Actually, you pre-processsing pipeline can be simplified.

>>> import gensim
>>> doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
>>> doc2 = "My father spends a lot of time driving my sister around to dance practice."
>>> doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."
>>> doc4 = "Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better."
>>> doc5 = "Health experts say that Sugar is not good for your lifestyle." 
>>> documents = [doc1, doc2, doc3, doc4, doc5]
>>> texts = map(gensim.utils.lemmatize,documents)
>>> texts
[['sugar/NN', 'be/VB', 'bad/JJ', 'consume/VB', 'sister/NN', 'like/VB', 'have/VB', 'sugar/NN', 'not/RB', 'father/NN'], ['father/NN', 'spend/VB', 'lot/NN', 'time/NN', 'drive/VB', 'sister/NN', 'dance/VB', 'practice/NN'], ['doctor/NN', 'suggest/VB', 'drive/VB', 'cause/VB', 'increased/JJ', 'stress/NN', 'blood/NN', 'pressure/NN'], ['sometimes/RB', 'feel/JJ', 'pressure/NN', 'perform/VB', 'well/RB', 'school/NN', 'father/NN', 'never/RB', 'seem/VB', 'drive/VB', 'sister/NN', 'do/VB', 'better/JJ'], ['health/NN', 'expert/NN', 'say/VB', 'sugar/NN', 'be/VB', 'not/RB', 'good/JJ', 'lifestyle/NN']]

Then to train the topic model:

>>> dictionary = gensim.corpora.Dictionary(texts)
>>> corpus = [dictionary.doc2bow(doc) for doc in texts]
>>> Lda = gensim.models.ldamodel.LdaModel
>>> ldamodel = Lda(corpus, num_topics=3, id2word = dictionary, passes=50)
>>> ldamodel.print_topics()
[(0, u'0.067*drive/VB + 0.067*pressure/NN + 0.067*stress/NN + 0.067*blood/NN + 0.067*doctor/NN + 0.067*increased/JJ + 0.067*cause/VB + 0.067*suggest/VB + 0.017*sister/NN + 0.017*father/NN'), (1, u'0.078*sugar/NN + 0.054*not/RB + 0.054*be/VB + 0.054*father/NN + 0.054*sister/NN + 0.031*do/VB + 0.031*seem/VB + 0.031*school/NN + 0.031*well/RB + 0.031*better/JJ'), (2, u'0.067*drive/VB + 0.067*sister/NN + 0.067*father/NN + 0.067*lot/NN + 0.067*practice/NN + 0.067*dance/VB + 0.067*spend/VB + 0.067*time/NN + 0.017*pressure/NN + 0.017*expert/NN')]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM