简体   繁体   English

将LSA / LSI与Naive Bayes相结合用于文档分类

[英]combining LSA/LSI with Naive Bayes for document classification

I'm new to the gensim package and vector space models in general, and I'm unsure of what exactly I should do with my LSA output. 我是gensim包和矢量空间模型的新手 ,我不确定我应该对我的LSA输出做些什么。

To give a brief overview of my goal, I'd like to enhance Naive Bayes Classifier using topic modeling to improve classification of reviews (positive or negative). 为了简要概述我的目标,我想使用主题建模来增强朴素贝叶斯分类器,以改进评论的分类(正面或负面)。 Here's a great paper I've been reading that has shaped my ideas but left me still somewhat confused about implementation.. 这是一篇很棒的论文,我一直在阅读,它塑造了我的想法,但让我对实施仍感到有些困惑。

I've already got working code for Naive Bayes--currently, I'm just using unigram bag of words as my features and labels are either positive or negative. 我已经为Naive Bayes编写了代码 - 目前,我只是使用unigram包字,因为我的功能和标签要么是正面的,要么是负面的。

Here's my gensim code 这是我的gensim代码

from pprint import pprint # pretty printer
import gensim as gs

# tutorial sample documents
docs = ["Human machine interface for lab abc computer applications",
              "A survey of user opinion of computer system response time",
              "The EPS user interface management system",
              "System and human system engineering testing of EPS",
              "Relation of user perceived response time to error measurement",
              "The generation of random binary unordered trees",
              "The intersection graph of paths in trees",
              "Graph minors IV Widths of trees and well quasi ordering",
              "Graph minors A survey"]


# stoplist removal, tokenization
stoplist = set('for a of the and to in'.split())
# for each document: lowercase document, split by whitespace, and add all its words not in stoplist to texts
texts = [[word for word in doc.lower().split() if word not in stoplist] for doc in docs]


# create dict
dict = gs.corpora.Dictionary(texts)
# create corpus
corpus = [dict.doc2bow(text) for text in texts]

# tf-idf
tfidf = gs.models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]

# latent semantic indexing with 10 topics
lsi = gs.models.LsiModel(corpus_tfidf, id2word=dict, num_topics =10)

for i in lsi.print_topics():
    print i

Here's output 这是输出

0.400*"system" + 0.318*"survey" + 0.290*"user" + 0.274*"eps" + 0.236*"management" + 0.236*"opinion" + 0.235*"response" + 0.235*"time" + 0.224*"interface" + 0.224*"computer"
0.421*"minors" + 0.420*"graph" + 0.293*"survey" + 0.239*"trees" + 0.226*"paths" + 0.226*"intersection" + -0.204*"system" + -0.196*"eps" + 0.189*"widths" + 0.189*"quasi"
-0.318*"time" + -0.318*"response" + -0.261*"error" + -0.261*"measurement" + -0.261*"perceived" + -0.261*"relation" + 0.248*"eps" + -0.203*"opinion" + 0.195*"human" + 0.190*"testing"
0.416*"random" + 0.416*"binary" + 0.416*"generation" + 0.416*"unordered" + 0.256*"trees" + -0.225*"minors" + -0.177*"survey" + 0.161*"paths" + 0.161*"intersection" + 0.119*"error"
-0.398*"abc" + -0.398*"lab" + -0.398*"machine" + -0.398*"applications" + -0.301*"computer" + 0.242*"system" + 0.237*"eps" + 0.180*"testing" + 0.180*"engineering" + 0.166*"management"

Any suggestions or general comments would be appreciated. 任何建议或一般意见将不胜感激。

Just started working on the same problem, but with SVM instead, AFAIK after training your model you need to do something like this: 刚刚开始研究同样的问题,但是在使用SVM代替AFAIK训练模型后,你需要做这样的事情:

new_text = 'here is some document'
text_bow = dict.doc2bow(new_text)
vector = lsi[text_bow]

Where vector is a topic distribution in your document, with length equal to number of topics you choose for training, 10 in your case. vector是文档中的主题分布,其长度等于您选择用于训练的主题数,在您的情况下为10。 So you need to represent all your documents as topic distributions and than feed them to classification algorithm. 因此,您需要将所有文档表示为主题分布,然后将它们提供给分类算法。

PS I know it's kind of an old question, but I keep seeing it in google results every time I searching ) PS我知道这是一个老问题,但我每次搜索时都会在谷歌搜索结果中看到它

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM