简体   繁体   English

如何在Gensim Topic建模上预测测试数据

[英]How to predict test data on Gensim Topic modelling

I have used Gensim LDAMallet for topic modelling but in what way we can predict sample paragraph and get their topic model using pretrained model. 我已经使用Gensim LDAMallet进行主题建模,但是我们可以通过哪种方式预测样本段落并使用预训练模型来获取其主题模型。

# Build the bigram and trigram models
bigram = gensim.models.Phrases(t_preprocess(dataset.data), min_count=5, threshold=100)
bigram_mod = gensim.models.phrases.Phraser(bigram) 

def make_bigrams(texts):
   return [bigram_mod[doc] for doc in texts]

data_words_bigrams = make_bigrams(t_preprocess(dataset.data))

# Create Dictionary
id2word = corpora.Dictionary(data_words_bigrams)

# Create Corpus
texts = data_words_bigrams

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

mallet_path='/home/riteshjain/anaconda3/mallet/mallet2.0.8/bin/mallet' 
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path,corpus=corpus, num_topics=12, id2word=id2word, random_seed = 0)

coherence_model_ldamallet = CoherenceModel(model=ldamallet, texts=texts, dictionary=id2word, coherence='c_v')

a = "When Honda builds a hybrid, you've got to be sure it’s a marvel. And an Accord Hybrid is when technology surpasses the known and takes a leap of faith into tomorrow. This is the next generation Accord, the ninth generation to be precise."

How to use this text (a) to get its topic from the pretrained model. 如何使用此文本(a)从预训练的模型中获取其主题。 Please help. 请帮忙。

You're going to want to process 'a' similarly to the trained set: 您将要像处理经过训练的集合一样处理“ a”:

 # import a new data set to be passed through the pre-trained LDA data_new = pd.read_csv('YourNew.csv', encoding = "ISO-8859-1"); data_new = data_new.dropna() data_text_new = data_new[['Your Target Column']] data_text_new['index'] = data_text_new.index documents_new = data_text_new # process the new data set through the lemmatization, and stopwork functions def preprocess(text): result = [] for token in gensim.utils.simple_preprocess(text): if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3: nltk.bigrams(token) result.append(lemmatize_stemming(token)) return result processed_docs_new = documents_new['Your Target Column'].map(preprocess) # create a dictionary of individual words and filter the dictionary dictionary_new = gensim.corpora.Dictionary(processed_docs_new[:]) dictionary_new.filter_extremes(no_below=15, no_above=0.5, keep_n=100000) # define the bow_corpus bow_corpus_new = [dictionary_new.doc2bow(doc) for doc in processed_docs_new] 

Then you can just pass it through as a function: 然后,您可以将其作为一个函数传递:

 a = ldamallet[bow_corpus_new[:len(bow_corpus_new)]] b = data_text_new topic_0=[] topic_1=[] topic_2=[] for i in a: topic_0.append(i[0][1]) topic_1.append(i[1][1]) topic_2.append(i[2][1]) d = {'Your Target Column': b['Your Target Column'].tolist(), 'topic_0': topic_0, 'topic_1': topic_1, 'topic_2': topic_2} df = pd.DataFrame(data=d) df.to_csv("YourAllocated.csv", index=True, mode = 'a') 

I hope this helps :) 我希望这有帮助 :)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM