繁体   English   中英

将 scikit-learn 向量化器和词汇表与 gensim 一起使用

[英]Using scikit-learn vectorizers and vocabularies with gensim

我正在尝试使用 gensim 主题模型回收 scikit-learn 矢量化器对象。 原因很简单:首先,我已经有大量的矢量化数据; 其次,我更喜欢 scikit-learn 矢量化器的界面和灵活性; 第三,尽管使用 gensim 进行主题建模非常快,但根据我的经验,计算它的字典( Dictionary() )相对较慢。

之前已经提出过类似的问题, 尤其是这里这里,桥接解决方案是 gensim 的Sparse2Corpus()函数,它将 Scipy 稀疏矩阵转换为 gensim 语料库对象。

但是,这种转换没有使用 sklearn 向量化器的vocabulary_属性,它保存了单词和特征 id 之间的映射。 为了打印每个主题的判别词( id2word主题模型中的 id2word,描述为“从词 id(整数)到词(字符串)的 aa 映射”),这种映射是必要的。

我知道 gensim 的Dictionary对象比 scikit 的vect.vocabulary_ (一个简单的 Python dict )更复杂(计算速度也更慢)......

在 gensim 模型中使用vect.vocabulary_作为id2word的任何想法?

一些示例代码:

# our data
documents = [u'Human machine interface for lab abc computer applications',
        u'A survey of user opinion of computer system response time',
        u'The EPS user interface management system',
        u'System and human system engineering testing of EPS',
        u'Relation of user perceived response time to error measurement',
        u'The generation of random binary unordered trees',
        u'The intersection graph of paths in trees',
        u'Graph minors IV Widths of trees and well quasi ordering',
        u'Graph minors A survey']

from sklearn.feature_extraction.text import CountVectorizer
# compute vector space with sklearn
vect = CountVectorizer(min_df=1, ngram_range=(1, 1), max_features=25000)
corpus_vect = vect.fit_transform(documents)
# each doc is a scipy sparse matrix
print vect.vocabulary_
#{u'and': 1, u'minors': 20, u'generation': 9, u'testing': 32, u'iv': 15, u'engineering': 5, u'computer': 4, u'relation': 28, u'human': 11, u'measurement': 19, u'unordered': 37, u'binary': 3, u'abc': 0, u'for': 8, u'ordering': 23, u'graph': 10, u'system': 31, u'machine': 17, u'to': 35, u'quasi': 26, u'time': 34, u'random': 27, u'paths': 24, u'of': 21, u'trees': 36, u'applications': 2, u'management': 18, u'lab': 16, u'interface': 13, u'intersection': 14, u'response': 29, u'perceived': 25, u'in': 12, u'widths': 40, u'well': 39, u'eps': 6, u'survey': 30, u'error': 7, u'opinion': 22, u'the': 33, u'user': 38}

import gensim
# transform sparse matrix into gensim corpus
corpus_vect_gensim = gensim.matutils.Sparse2Corpus(corpus_vect, documents_columns=False)
lsi = gensim.models.LsiModel(corpus_vect_gensim, num_topics=4)
# I instead would like something like this line below
# lsi = gensim.models.LsiModel(corpus_vect_gensim, id2word=vect.vocabulary_, num_topics=2)
print lsi.print_topics(2)
#['0.622*"21" + 0.359*"31" + 0.256*"38" + 0.206*"29" + 0.206*"34" + 0.197*"36" + 0.170*"33" + 0.168*"1" + 0.158*"10" + 0.147*"4"', '0.399*"36" + 0.364*"10" + -0.295*"31" + 0.245*"20" + -0.226*"38" + 0.194*"26" + 0.194*"15" + 0.194*"39" + 0.194*"23" + 0.194*"40"']

Gensim 不需要Dictionary对象。 您可以直接使用普通dict作为id2word的输入,只要它将 id(整数)映射到单词(字符串)。

事实上,任何类似 dict 的东西都可以(包括dictDictionarySqliteDict ...)。

(顺便说一句,gensim 的Dictionary是一个简单的 Python dict 。不确定你对Dictionary性能的评论来自哪里,你无法比 Python 中的普通dict更快地获得映射。也许你把它与文本预处理混淆了(不是一部分gensim),这确实可能很慢。)

举个最后一个例子,scikit-learn 的向量化器对象可以用Sparse2Corpus转换成dict的语料库格式,而词汇表可以通过简单地交换键和值来回收:

# transform sparse matrix into gensim corpus
corpus_vect_gensim = gensim.matutils.Sparse2Corpus(corpus_vect, documents_columns=False)

# transform scikit vocabulary into gensim dictionary
vocabulary_gensim = {}
for key, val in vect.vocabulary_.items():
    vocabulary_gensim[val] = key

我也在使用这两个进行一些代码实验。 显然现在有一种方法可以从语料库构建字典

from gensim.corpora.dictionary import Dictionary
dictionary = Dictionary.from_corpus(corpus_vect_gensim,
                                    id2word=dict((id, word) for word, id in vect.vocabulary_.items()))

然后,您可以将此字典用于 tfidf、LSI 或 LDA 模型。

工作 python 3 代码的解决方案。

import gensim
from gensim.corpora.dictionary import Dictionary
from sklearn.feature_extraction.text import CountVectorizer

def vect2gensim(vectorizer, dtmatrix):
     # transform sparse matrix into gensim corpus and dictionary
    corpus_vect_gensim = gensim.matutils.Sparse2Corpus(dtmatrix, documents_columns=False)
    dictionary = Dictionary.from_corpus(corpus_vect_gensim,
        id2word=dict((id, word) for word, id in vectorizer.vocabulary_.items()))

    return (corpus_vect_gensim, dictionary)

documents = [u'Human machine interface for lab abc computer applications',
        u'A survey of user opinion of computer system response time',
        u'The EPS user interface management system',
        u'System and human system engineering testing of EPS',
        u'Relation of user perceived response time to error measurement',
        u'The generation of random binary unordered trees',
        u'The intersection graph of paths in trees',
        u'Graph minors IV Widths of trees and well quasi ordering',
        u'Graph minors A survey']


# compute vector space with sklearn
vect = CountVectorizer(min_df=1, ngram_range=(1, 1), max_features=25000)
corpus_vect = vect.fit_transform(documents)

# transport to gensim
(gensim_corpus, gensim_dict) = vect2gensim(vect, corpus_vect)

提出一个答案,因为我还没有 50 的声誉。

直接使用 vect.vocabulary_(键和值互换)在 Python 3 上不起作用,因为 dict.keys() 现在返回可迭代视图而不是列表。 相关的错误是:

TypeError: can only concatenate list (not "dict_keys") to list

要使这项工作在 Python 3 上运行,请将 lsimodel.py 中的第 301 行更改为

self.num_terms = 1 + max([-1] + list(self.id2word.keys()))

希望这可以帮助。

教程示例https://radimrehurek.com/gensim/auto_examples/core/run_similarity_queries.html#sphx-glr-auto-examples-core-run-similarity-queries-py

Scikit Tokenizer 和 Stopwords 是唯一的区别

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
import gensim

from gensim import models


print("Text Similarity with Gensim and Scikit utils")
# compute vector space with sklearn
documents = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]

# Using Scikit learn feature extractor

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(min_df=1, ngram_range=(1, 1), stop_words='english')
corpus_vect = vect.fit_transform(documents)
# take the dict keys out
texts = list(vect.vocabulary_.keys())

from gensim import corpora
dictionary = corpora.Dictionary([texts])

# transform scikit vocabulary into gensim dictionary
corpus_vect_gensim = gensim.matutils.Sparse2Corpus(corpus_vect, documents_columns=False)

# create LSI model
lsi = models.LsiModel(corpus_vect_gensim, id2word=dictionary, num_topics=2)

# convert the query to LSI space
doc = "Human computer interaction"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow]  
print(vec_lsi)

# Find similarities
from gensim import similarities
index = similarities.MatrixSimilarity(lsi[corpus_vect_gensim])  # transform corpus to LSI space and index it

sims = index[vec_lsi]  # perform a similarity query against the corpus
print(list(enumerate(sims)))  # print (document_number, document_similarity) 2-tuples

sims = sorted(enumerate(sims), key=lambda item: -item[1])
for doc_position, doc_score in sims:
    print(doc_score, documents[doc_position])

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM