簡體   English   中英

將 scikit-learn 向量化器和詞匯表與 gensim 一起使用

[英]Using scikit-learn vectorizers and vocabularies with gensim

我正在嘗試使用 gensim 主題模型回收 scikit-learn 矢量化器對象。 原因很簡單:首先,我已經有大量的矢量化數據; 其次,我更喜歡 scikit-learn 矢量化器的界面和靈活性; 第三,盡管使用 gensim 進行主題建模非常快,但根據我的經驗,計算它的字典( Dictionary() )相對較慢。

之前已經提出過類似的問題, 尤其是這里這里,橋接解決方案是 gensim 的Sparse2Corpus()函數,它將 Scipy 稀疏矩陣轉換為 gensim 語料庫對象。

但是,這種轉換沒有使用 sklearn 向量化器的vocabulary_屬性,它保存了單詞和特征 id 之間的映射。 為了打印每個主題的判別詞( id2word主題模型中的 id2word,描述為“從詞 id(整數)到詞(字符串)的 aa 映射”),這種映射是必要的。

我知道 gensim 的Dictionary對象比 scikit 的vect.vocabulary_ (一個簡單的 Python dict )更復雜(計算速度也更慢)......

在 gensim 模型中使用vect.vocabulary_作為id2word的任何想法?

一些示例代碼:

# our data
documents = [u'Human machine interface for lab abc computer applications',
        u'A survey of user opinion of computer system response time',
        u'The EPS user interface management system',
        u'System and human system engineering testing of EPS',
        u'Relation of user perceived response time to error measurement',
        u'The generation of random binary unordered trees',
        u'The intersection graph of paths in trees',
        u'Graph minors IV Widths of trees and well quasi ordering',
        u'Graph minors A survey']

from sklearn.feature_extraction.text import CountVectorizer
# compute vector space with sklearn
vect = CountVectorizer(min_df=1, ngram_range=(1, 1), max_features=25000)
corpus_vect = vect.fit_transform(documents)
# each doc is a scipy sparse matrix
print vect.vocabulary_
#{u'and': 1, u'minors': 20, u'generation': 9, u'testing': 32, u'iv': 15, u'engineering': 5, u'computer': 4, u'relation': 28, u'human': 11, u'measurement': 19, u'unordered': 37, u'binary': 3, u'abc': 0, u'for': 8, u'ordering': 23, u'graph': 10, u'system': 31, u'machine': 17, u'to': 35, u'quasi': 26, u'time': 34, u'random': 27, u'paths': 24, u'of': 21, u'trees': 36, u'applications': 2, u'management': 18, u'lab': 16, u'interface': 13, u'intersection': 14, u'response': 29, u'perceived': 25, u'in': 12, u'widths': 40, u'well': 39, u'eps': 6, u'survey': 30, u'error': 7, u'opinion': 22, u'the': 33, u'user': 38}

import gensim
# transform sparse matrix into gensim corpus
corpus_vect_gensim = gensim.matutils.Sparse2Corpus(corpus_vect, documents_columns=False)
lsi = gensim.models.LsiModel(corpus_vect_gensim, num_topics=4)
# I instead would like something like this line below
# lsi = gensim.models.LsiModel(corpus_vect_gensim, id2word=vect.vocabulary_, num_topics=2)
print lsi.print_topics(2)
#['0.622*"21" + 0.359*"31" + 0.256*"38" + 0.206*"29" + 0.206*"34" + 0.197*"36" + 0.170*"33" + 0.168*"1" + 0.158*"10" + 0.147*"4"', '0.399*"36" + 0.364*"10" + -0.295*"31" + 0.245*"20" + -0.226*"38" + 0.194*"26" + 0.194*"15" + 0.194*"39" + 0.194*"23" + 0.194*"40"']

Gensim 不需要Dictionary對象。 您可以直接使用普通dict作為id2word的輸入,只要它將 id(整數)映射到單詞(字符串)。

事實上,任何類似 dict 的東西都可以(包括dictDictionarySqliteDict ...)。

(順便說一句,gensim 的Dictionary是一個簡單的 Python dict 。不確定你對Dictionary性能的評論來自哪里,你無法比 Python 中的普通dict更快地獲得映射。也許你把它與文本預處理混淆了(不是一部分gensim),這確實可能很慢。)

舉個最后一個例子,scikit-learn 的向量化器對象可以用Sparse2Corpus轉換成dict的語料庫格式,而詞匯表可以通過簡單地交換鍵和值來回收:

# transform sparse matrix into gensim corpus
corpus_vect_gensim = gensim.matutils.Sparse2Corpus(corpus_vect, documents_columns=False)

# transform scikit vocabulary into gensim dictionary
vocabulary_gensim = {}
for key, val in vect.vocabulary_.items():
    vocabulary_gensim[val] = key

我也在使用這兩個進行一些代碼實驗。 顯然現在有一種方法可以從語料庫構建字典

from gensim.corpora.dictionary import Dictionary
dictionary = Dictionary.from_corpus(corpus_vect_gensim,
                                    id2word=dict((id, word) for word, id in vect.vocabulary_.items()))

然后,您可以將此字典用於 tfidf、LSI 或 LDA 模型。

工作 python 3 代碼的解決方案。

import gensim
from gensim.corpora.dictionary import Dictionary
from sklearn.feature_extraction.text import CountVectorizer

def vect2gensim(vectorizer, dtmatrix):
     # transform sparse matrix into gensim corpus and dictionary
    corpus_vect_gensim = gensim.matutils.Sparse2Corpus(dtmatrix, documents_columns=False)
    dictionary = Dictionary.from_corpus(corpus_vect_gensim,
        id2word=dict((id, word) for word, id in vectorizer.vocabulary_.items()))

    return (corpus_vect_gensim, dictionary)

documents = [u'Human machine interface for lab abc computer applications',
        u'A survey of user opinion of computer system response time',
        u'The EPS user interface management system',
        u'System and human system engineering testing of EPS',
        u'Relation of user perceived response time to error measurement',
        u'The generation of random binary unordered trees',
        u'The intersection graph of paths in trees',
        u'Graph minors IV Widths of trees and well quasi ordering',
        u'Graph minors A survey']


# compute vector space with sklearn
vect = CountVectorizer(min_df=1, ngram_range=(1, 1), max_features=25000)
corpus_vect = vect.fit_transform(documents)

# transport to gensim
(gensim_corpus, gensim_dict) = vect2gensim(vect, corpus_vect)

提出一個答案,因為我還沒有 50 的聲譽。

直接使用 vect.vocabulary_(鍵和值互換)在 Python 3 上不起作用,因為 dict.keys() 現在返回可迭代視圖而不是列表。 相關的錯誤是:

TypeError: can only concatenate list (not "dict_keys") to list

要使這項工作在 Python 3 上運行,請將 lsimodel.py 中的第 301 行更改為

self.num_terms = 1 + max([-1] + list(self.id2word.keys()))

希望這可以幫助。

教程示例https://radimrehurek.com/gensim/auto_examples/core/run_similarity_queries.html#sphx-glr-auto-examples-core-run-similarity-queries-py

Scikit Tokenizer 和 Stopwords 是唯一的區別

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
import gensim

from gensim import models


print("Text Similarity with Gensim and Scikit utils")
# compute vector space with sklearn
documents = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]

# Using Scikit learn feature extractor

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(min_df=1, ngram_range=(1, 1), stop_words='english')
corpus_vect = vect.fit_transform(documents)
# take the dict keys out
texts = list(vect.vocabulary_.keys())

from gensim import corpora
dictionary = corpora.Dictionary([texts])

# transform scikit vocabulary into gensim dictionary
corpus_vect_gensim = gensim.matutils.Sparse2Corpus(corpus_vect, documents_columns=False)

# create LSI model
lsi = models.LsiModel(corpus_vect_gensim, id2word=dictionary, num_topics=2)

# convert the query to LSI space
doc = "Human computer interaction"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow]  
print(vec_lsi)

# Find similarities
from gensim import similarities
index = similarities.MatrixSimilarity(lsi[corpus_vect_gensim])  # transform corpus to LSI space and index it

sims = index[vec_lsi]  # perform a similarity query against the corpus
print(list(enumerate(sims)))  # print (document_number, document_similarity) 2-tuples

sims = sorted(enumerate(sims), key=lambda item: -item[1])
for doc_position, doc_score in sims:
    print(doc_score, documents[doc_position])

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM