繁体   English   中英

协作主题建模的简单Python实现?

[英]Simple Python implementation of collaborative topic modeling?

我发现这两篇文章结合了协同过滤(矩阵分解)和主题建模(LDA),根据用户感兴趣的帖子/文章的主题条款向用户推荐类似的文章/帖子。

论文(PDF格式)为:“ 推荐科学论文的 协作主题建模 ”和“ 推荐GitHub存储库的协作主题建模

新算法称为协作主题回归 我希望找到一些实现此功能的python代码,但无济于事。 这可能是一个很长的镜头,但有人可以显示一个简单的python示例吗?

这应该让你开始(虽然不知道为什么还没有发布): https//github.com/arongdari/python-topic-model

更具体地说: https//github.com/arongdari/python-topic-model/blob/master/ptm/collabotm.py

class CollaborativeTopicModel:
    """
    Wang, Chong, and David M. Blei. "Collaborative topic 
                                modeling for recommending scientific articles."
    Proceedings of the 17th ACM SIGKDD international conference on Knowledge
                                discovery and data mining. ACM, 2011.
    Attributes
    ----------
    n_item: int
        number of items
    n_user: int
        number of users
    R: ndarray, shape (n_user, n_item)
        user x item rating matrix
    """

看起来很好,很直接。 我仍然建议至少看看gensim Radim在优化该软件方面做得非常出色。

使用gensin的一个非常简单的LDA实现。 您可以在此处找到更多信息: https//radimrehurek.com/gensim/tutorial.html

我希望它可以帮到你

from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import RSLPStemmer
from gensim import corpora, models
import gensim

st = RSLPStemmer()
texts = []

doc1 = "Veganism is both the practice of abstaining from the use of animal products, particularly in diet, and an associated philosophy that rejects the commodity status of animals"
doc2 = "A follower of either the diet or the philosophy is known as a vegan."
doc3 = "Distinctions are sometimes made between several categories of veganism."
doc4 = "Dietary vegans refrain from ingesting animal products. This means avoiding not only meat but also egg and dairy products and other animal-derived foodstuffs."
doc5 = "Some dietary vegans choose to wear clothing that includes animal products (for example, leather or wool)." 

docs = [doc1, doc2, doc3, doc4, doc5]

for i in docs:

    tokens = word_tokenize(i.lower())
    stopped_tokens = [w for w in tokens if not w in stopwords.words('english')]
    stemmed_tokens = [st.stem(i) for i in stopped_tokens]
    texts.append(stemmed_tokens)

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

# generate LDA model using gensim  
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word = dictionary, passes=20)
print(ldamodel.print_topics(num_topics=2, num_words=4))

[(0,u'0.066 *动物+ 0.065 *,+ 0.047 *产品+ 0.028 *哲学'),(1,u'0.085 *。+ 0.047 *产品+ 0.028 *膳食+ 0.028 * veg')]

你已经标记了机器学习python ,你是否看过python pandassklearn模块,因为有了这两个模块,你可以快速创建大量的线性回归对象。

还有一个相对于主题提取 (具有非负矩阵分解和潜在Dirichlet分配)的代码示例,它可以满足您的确切需求,还可以帮助您发现sklearn模块

问候

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM