Having a wrong Output in topic modelling

Question

I have tried topic modelling in python . But its displaying wrong output. I have provided sample example and codes below.

## Documents

doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
doc2 = "My father spends a lot of time driving my sister around to dance practice."
doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."
doc4 = "Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better."
doc5 = "Health experts say that Sugar is not good for your lifestyle." 

# compile documents
doc_complete = [doc1, doc2, doc3, doc4, doc5]

from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
import string
stop = set(stopwords.words('english'))
exclude = set(string.punctuation) 
lemma = WordNetLemmatizer()
def clean(doc):
stop_free = "".join([i for i in doc.lower() if i not in stop])
punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
return normalized

doc_clean = [clean(doc) for doc in doc_complete] 

#Preparing Document Term Matrix
import gensim 

dictionary = corpora.Dictionary([doc_clean])
corpus = [dictionary.doc2bow(doc) for doc in [doc_clean]]

#Running LDA Model

Lda = gensim.models.ldamodel.LdaModel
ldamodel = Lda(corpus, num_topics=3, id2word = dictionary, passes=50)


print(ldamodel.print_topics(num_topics=3, num_words=3))

I am getting an output like following :

[(0, u'0.200*cr ugge h rvng cue ncree re n bl preure + 0.200*fher pen l f e rvng er run nce prcce + 0.200*ee feel preure perfr well chl bu fher never ee rve er beer'), (1, u'0.200*helh exper h ugr n g fr ur lfele + 0.200*cr ugge h rvng cue ncree re n bl preure + 0.200*ee feel preure perfr well chl bu fher never ee rve er beer'), (2, u'0.200*fher pen l f e rvng er run nce prcce + 0.200*ugr b cnue er lke hve ugr bu n fher + 0.200*ee feel preure perfr well chl bu fher never ee rve er beer')]

I am wondering , what I have missed. Thanks

Answer 1

The problem happens because of you didn't tokenize your documents before removing the stop words. Instead you iterate through each character and remove the characters that are stopwords, eg "a", "i":

>>> from nltk.corpus import stopwords
>>> stop = stopwords.words('english')
>>> stop
[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', u'her', u'hers', u'herself', u'it', u'its', u'itself', u'they', u'them', u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this', u'that', u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be', u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing', u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until', u'while', u'of', u'at', u'by', u'for', u'with', u'about', u'against', u'between', u'into', u'through', u'during', u'before', u'after', u'above', u'below', u'to', u'from', u'up', u'down', u'in', u'out', u'on', u'off', u'over', u'under', u'again', u'further', u'then', u'once', u'here', u'there', u'when', u'where', u'why', u'how', u'all', u'any', u'both', u'each', u'few', u'more', u'most', u'other', u'some', u'such', u'no', u'nor', u'not', u'only', u'own', u'same', u'so', u'than', u'too', u'very', u's', u't', u'can', u'will', u'just', u'don', u'should', u'now', u'd', u'll', u'm', u'o', u're', u've', u'y', u'ain', u'aren', u'couldn', u'didn', u'doesn', u'hadn', u'hasn', u'haven', u'isn', u'ma', u'mightn', u'mustn', u'needn', u'shan', u'shouldn', u'wasn', u'weren', u'won', u'wouldn']
>>> doc = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
>>> "".join([i for i in doc.lower() if i not in stop])
'ugr  b  cnue.  er lke  hve ugr, bu n  fher.'

You should have processed the stopword removals like this:

>>> from nltk import word_tokenize
>>> doc = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
>>> " ".join([i for i in word_tokenize(doc.lower()) if i not in stop])
'sugar bad consume . sister likes sugar , father .'

See Stopword removal with NLTK

Actually, you pre-processsing pipeline can be simplified.

>>> import gensim
>>> doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
>>> doc2 = "My father spends a lot of time driving my sister around to dance practice."
>>> doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."
>>> doc4 = "Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better."
>>> doc5 = "Health experts say that Sugar is not good for your lifestyle." 
>>> documents = [doc1, doc2, doc3, doc4, doc5]
>>> texts = map(gensim.utils.lemmatize,documents)
>>> texts
[['sugar/NN', 'be/VB', 'bad/JJ', 'consume/VB', 'sister/NN', 'like/VB', 'have/VB', 'sugar/NN', 'not/RB', 'father/NN'], ['father/NN', 'spend/VB', 'lot/NN', 'time/NN', 'drive/VB', 'sister/NN', 'dance/VB', 'practice/NN'], ['doctor/NN', 'suggest/VB', 'drive/VB', 'cause/VB', 'increased/JJ', 'stress/NN', 'blood/NN', 'pressure/NN'], ['sometimes/RB', 'feel/JJ', 'pressure/NN', 'perform/VB', 'well/RB', 'school/NN', 'father/NN', 'never/RB', 'seem/VB', 'drive/VB', 'sister/NN', 'do/VB', 'better/JJ'], ['health/NN', 'expert/NN', 'say/VB', 'sugar/NN', 'be/VB', 'not/RB', 'good/JJ', 'lifestyle/NN']]

Then to train the topic model:

>>> dictionary = gensim.corpora.Dictionary(texts)
>>> corpus = [dictionary.doc2bow(doc) for doc in texts]
>>> Lda = gensim.models.ldamodel.LdaModel
>>> ldamodel = Lda(corpus, num_topics=3, id2word = dictionary, passes=50)
>>> ldamodel.print_topics()
[(0, u'0.067*drive/VB + 0.067*pressure/NN + 0.067*stress/NN + 0.067*blood/NN + 0.067*doctor/NN + 0.067*increased/JJ + 0.067*cause/VB + 0.067*suggest/VB + 0.017*sister/NN + 0.017*father/NN'), (1, u'0.078*sugar/NN + 0.054*not/RB + 0.054*be/VB + 0.054*father/NN + 0.054*sister/NN + 0.031*do/VB + 0.031*seem/VB + 0.031*school/NN + 0.031*well/RB + 0.031*better/JJ'), (2, u'0.067*drive/VB + 0.067*sister/NN + 0.067*father/NN + 0.067*lot/NN + 0.067*practice/NN + 0.067*dance/VB + 0.067*spend/VB + 0.067*time/NN + 0.017*pressure/NN + 0.017*expert/NN')]

Having a wrong Output in topic modelling

Question

1 answers

solution1
2 ACCPTED 2016-08-25 12:58:52

Having a wrong Output in topic modelling

Question

1 answers

solution1 2 ACCPTED 2016-08-25 12:58:52

solution1
2 ACCPTED 2016-08-25 12:58:52