使用 Spacy 進行主題建模 - 沒有做出很好的預測

Question

我正在做一個主題建模任務，我正在接受人們的反饋（文本）並試圖從中提取重要的主題。

反饋很短，我不知道這是否給我們帶來了問題。 下面是我的代碼，我錯過了什么很明顯的東西嗎？

我正在刪除停用詞，詞形還原，僅保留名詞並刪除停用詞。 但是我將這些傳遞到模型中，它並沒有像我希望的那樣工作

一個大問題是語義，客戶可以用不同的方式指代同一個概念：商店、精品店、商店、超市等……他們都指的是商店，但 LDA 將這些視為不同的概念和轉儲盡管“我愛這家商店”和“我愛這家商店”是同一個陳述，但他們將它們分成了不同的主題。

import spacy
import pandas as pd
from textblob import TextBlob

#set display options
pd.set_option('display.max_colwidth', 0)
pd.set_option('display.max_rows', 0)

#ingest data
df = pd.read_csv('surv.csv')

#import spacy language library and stopword dictionary
nlp = spacy.load('en_core_web_sm')
all_stopwords = nlp.Defaults.stop_words

#Limit DF to columns of interest and drop nulls
responses = df[['Comment', 'score']]
responses = responses.dropna()

#lemmatize the strings
def cleanup(row):
    comment = row['Comment']
    comment = nlp(comment)
    sent = []
    for word in comment:
        sent.append(word.lemma_)    
    return " ".join(sent)

#keep only nouns
def only_nouns(row):
    comment = row['nostops']
    blob = TextBlob(comment)
    x = blob.noun_phrases
    return " ".join(x)

def pos(row):
    comment = row['nostops']
    comment = nlp(comment)
    nouns = []
    i=0
    while i < len(comment)-1:
        if comment[i].pos_ == 'NOUN':
            nouns.append(comment[i])
        i=i+1
    return nouns
        
#remove the stop words
def remove_stops(row):
    comment = row['Comment']
    comment = comment.split(' ')  
    rem = []
    for word in comment:
        if word not in all_stopwords:
            rem.append(word)
    return " ".join(rem)

#What entities are defined in the document
def split_entities(row):
    comment = row['Comment']
    comment = nlp(comment)
    entities = []
    for ent in comment.ents:
        entities.append(ent)
    return entities          

#Call functions
responses['lemmas'] = responses.apply(cleanup,axis=1)            
responses['nostops'] = responses.apply(remove_stops,axis=1)
responses['nouns'] = responses.apply(pos, axis=1)
responses['nouns2'] = responses.apply(only_nouns, axis=1)
responses['entities'] = responses.apply(split_entities,axis=1)


from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
cv = CountVectorizer(max_df=0.9, min_df=2, stop_words='english') 
document_term_matrix = cv.fit_transform(responses['nouns'])
lda = LatentDirichletAllocation(n_components=4, random_state=42)
lda.fit(document_term_matrix)
topic_results = lda.transform(document_term_matrix)

Answer 1

一般建議：您是否嘗試在 sklearn 中添加TF-IDF？ 這是根據單詞在文檔中和跨文檔出現的頻率來衡量單詞的好方法，它提高了 LDA 輸出的質量。 您可以將它與“CountVectorizer”一起添加。 這是來自sklearn 文檔的一個很好的完整示例。

針對您希望將其視為同義詞（“商店、精品店、商店、超市”）的單詞問題的具體建議：我想我會添加一個預處理步驟，將所有這些單獨的單詞替換為完全相同的標記（例如，將所有出現的“商店、精品店、商店、超市”轉換為“商店”）。 它需要手動創建同義詞列表，但這是解決問題的簡單方法。

使用 Spacy 進行主題建模 - 沒有做出很好的預測

問題描述

1 個解決方案

解決方案1
1 2020-11-12 19:52:49

使用 Spacy 進行主題建模 - 沒有做出很好的預測

問題描述

1 個解決方案

解決方案1 1 2020-11-12 19:52:49

解決方案1
1 2020-11-12 19:52:49