文档分类的监督潜在狄利克雷分配？

Question

I have a bunch of already human-classified documents in some groups.我在某些组中有一堆已经人工分类的文件。

Is there a modified version of lda which I can use to train a model and then later classify unknown documents with it?是否有 lda 的修改版本，我可以用它来训练模型，然后用它对未知文档进行分类？

Answer 1

For what it's worth, LDA as a classifier is going to be fairly weak because it's a generative model, and classification is a discriminative problem.就其价值而言，LDA 作为分类器将相当薄弱，因为它是一个生成模型，而分类是一个判别问题。 There is a variant of LDA calledsupervised LDA which uses a more discriminative criterion to form the topics (you can get source for this in various places), and there's also a paper with a max margin formulation that I don't know the status of source-code-wise.有一种称为监督 LDA的 LDA 变体，它使用更具辨别力的标准来形成主题（你可以在不同的地方获得它的来源），还有一篇带有最大边距公式的论文，我不知道它的状态源代码方面。 I would avoid the Labelled LDA formulation unless you're sure that's what you want, because it makes a strong assumption about the correspondence between topics and categories in the classification problem.除非您确定这是您想要的，否则我会避免使用带标签的 LDA 公式，因为它对分类问题中主题和类别之间的对应关系做出了强有力的假设。

However, it's worth pointing out that none of these methods use the topic model directly to do the classification.然而，值得指出的是，这些方法都没有直接使用主题模型来进行分类。 Instead, they take documents, and instead of using word-based features use the posterior over the topics (the vector that results from inference for the document) as its feature representation before feeding it to a classifier, usually a Linear SVM.相反，他们使用文档，而不是使用基于单词的特征，而是使用主题的后验（从文档推理产生的向量）作为其特征表示，然后再将其提供给分类器，通常是线性 SVM。 This gets you a topic model based dimensionality reduction, followed by a strong discriminative classifier, which is probably what you're after.这为您提供了一个基于主题模型的降维，然后是一个强大的判别分类器，这可能就是您所追求的。 This pipeline is available in most languages using popular toolkits.使用流行的工具包，大多数语言都可以使用此管道。

Answer 2

You can implement supervised LDA with PyMC that uses Metropolis sampler to learn the latent variables in the following graphical model:您可以通过使用 Metropolis 采样器的 PyMC 实现受监督的 LDA ，以学习以下图形模型中的潜在变量：

The training corpus consists of 10 movie reviews (5 positive and 5 negative) along with the associated star rating for each document.训练语料库包含 10 条电影评论（5 条正面和 5 条负面）以及每个文档的相关星级。 The star rating is known as a response variable which is a quantity of interest associated with each document.星级被称为响应变量，它是与每个文档相关联的兴趣量。 The documents and response variables are modeled jointly in order to find latent topics that will best predict the response variables for future unlabeled documents.文档和响应变量被联合建模，以便找到能够最好地预测未来未标记文档的响应变量的潜在主题。 For more information, check out the original paper .有关更多信息，请查看原始论文。 Consider the following code:考虑以下代码：

import pymc as pm
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

train_corpus = ["exploitative and largely devoid of the depth or sophistication ",
                "simplistic silly and tedious",
                "it's so laddish and juvenile only teenage boys could possibly find it funny",
                "it shows that some studios firmly believe that people have lost the ability to think",
                "our culture is headed down the toilet with the ferocity of a frozen burrito",
                "offers that rare combination of entertainment and education",
                "the film provides some great insight",
                "this is a film well worth seeing",
                "a masterpiece four years in the making",
                "offers a breath of the fresh air of true sophistication"]
test_corpus =  ["this is a really positive review, great film"]
train_response = np.array([3, 1, 3, 2, 1, 5, 4, 4, 5, 5]) - 3

#LDA parameters
num_features = 1000  #vocabulary size
num_topics = 4       #fixed for LDA

tfidf = TfidfVectorizer(max_features = num_features, max_df=0.95, min_df=0, stop_words = 'english')

#generate tf-idf term-document matrix
A_tfidf_sp = tfidf.fit_transform(train_corpus)  #size D x V

print "number of docs: %d" %A_tfidf_sp.shape[0]
print "dictionary size: %d" %A_tfidf_sp.shape[1]

#tf-idf dictionary    
tfidf_dict = tfidf.get_feature_names()

K = num_topics # number of topics
V = A_tfidf_sp.shape[1] # number of words
D = A_tfidf_sp.shape[0] # number of documents

data = A_tfidf_sp.toarray()

#Supervised LDA Graphical Model
Wd = [len(doc) for doc in data]        
alpha = np.ones(K)
beta = np.ones(V)

theta = pm.Container([pm.CompletedDirichlet("theta_%s" % i, pm.Dirichlet("ptheta_%s" % i, theta=alpha)) for i in range(D)])
phi = pm.Container([pm.CompletedDirichlet("phi_%s" % k, pm.Dirichlet("pphi_%s" % k, theta=beta)) for k in range(K)])    

z = pm.Container([pm.Categorical('z_%s' % d, p = theta[d], size=Wd[d], value=np.random.randint(K, size=Wd[d])) for d in range(D)])

@pm.deterministic
def zbar(z=z):    
    zbar_list = []
    for i in range(len(z)):
        hist, bin_edges = np.histogram(z[i], bins=K)
        zbar_list.append(hist / float(np.sum(hist)))                
    return pm.Container(zbar_list)

eta = pm.Container([pm.Normal("eta_%s" % k, mu=0, tau=1.0/10**2) for k in range(K)])
y_tau = pm.Gamma("tau", alpha=0.1, beta=0.1)

@pm.deterministic
def y_mu(eta=eta, zbar=zbar):
    y_mu_list = []
    for i in range(len(zbar)):
        y_mu_list.append(np.dot(eta, zbar[i]))
    return pm.Container(y_mu_list)

#response likelihood
y = pm.Container([pm.Normal("y_%s" % d, mu=y_mu[d], tau=y_tau, value=train_response[d], observed=True) for d in range(D)])

# cannot use p=phi[z[d][i]] here since phi is an ordinary list while z[d][i] is stochastic
w = pm.Container([pm.Categorical("w_%i_%i" % (d,i), p = pm.Lambda('phi_z_%i_%i' % (d,i), lambda z=z[d][i], phi=phi: phi[z]),
                  value=data[d][i], observed=True) for d in range(D) for i in range(Wd[d])])

model = pm.Model([theta, phi, z, eta, y, w])
mcmc = pm.MCMC(model)
mcmc.sample(iter=1000, burn=100, thin=2)

#visualize topics    
phi0_samples = np.squeeze(mcmc.trace('phi_0')[:])
phi1_samples = np.squeeze(mcmc.trace('phi_1')[:])
phi2_samples = np.squeeze(mcmc.trace('phi_2')[:])
phi3_samples = np.squeeze(mcmc.trace('phi_3')[:])
ax = plt.subplot(221)
plt.bar(np.arange(V), phi0_samples[-1,:])
ax = plt.subplot(222)
plt.bar(np.arange(V), phi1_samples[-1,:])
ax = plt.subplot(223)
plt.bar(np.arange(V), phi2_samples[-1,:])
ax = plt.subplot(224)
plt.bar(np.arange(V), phi3_samples[-1,:])
plt.show()

Given the training data (observed words and response variables), we can learn the global topics (beta) and regression coefficients (eta) for predicting the response variable (Y) in addition to topic proportions for each document (theta).给定训练数据（观察到的词和响应变量），除了每个文档的主题比例（theta）之外，我们还可以学习全局主题（beta）和回归系数（eta）来预测响应变量（Y）。 In order to make predictions of Y given the learned beta and eta, we can define a new model where we do not observe Y and use the previously learned beta and eta to obtain the following result:为了在给定学习到的 beta 和 eta 的情况下对 Y 进行预测，我们可以定义一个新模型，在该模型中我们不观察 Y 并使用先前学习的 beta 和 eta 来获得以下结果：

Here we predicted a positive review (approx 2 given review rating range of -2 to 2) for the test corpus consisting of one sentence: "this is a really positive review, great film" as shown by the mode of the posterior histogram on the right.在这里，我们预测由一个句子组成的测试语料库的正面评论（给定评论评分范围为 -2 到 2，大约为 2）：“这是一个非常积极的评论，很棒的电影”，如后验直方图的模式所示正确的。 See ipython notebook for a complete implementation.有关完整的实现，请参阅ipython notebook 。

Answer 3

Yes you can try the Labelled LDA in the stanford parser at http://nlp.stanford.edu/software/tmt/tmt-0.4/是的，您可以在http://nlp.stanford.edu/software/tmt/tmt-0.4/的斯坦福解析器中尝试标记的 LDA

文档分类的监督潜在狄利克雷分配？

问题描述

3 个解决方案

解决方案1
15 已采纳 2012-11-26 10:26:48

解决方案2
7 2017-07-25 19:28:14

解决方案3
3 2012-11-25 21:59:44

文档分类的监督潜在狄利克雷分配？

问题描述

3 个解决方案

解决方案1 15 已采纳 2012-11-26 10:26:48

解决方案2 7 2017-07-25 19:28:14

解决方案3 3 2012-11-25 21:59:44

解决方案1
15 已采纳 2012-11-26 10:26:48

解决方案2
7 2017-07-25 19:28:14

解决方案3
3 2012-11-25 21:59:44