簡體   English   中英

Python; 使用 NGram 情感分析 - 無法獲得前 5 個詞

[英]Python; Using NGram sentiment analysis - cannot get top 5 words

我按如下方式設置了我的 CountVectorizer;

cv = CountVectorizer(binary=True)
X = cv.fit_transform(train_text)
X_test = cv.transform(test_text)

當我使用 SVM 時,我可以打印出情緒分析中的前 5 個詞;

final_svm  = LinearSVC(C=best_c)
final_svm.fit(X, target)
final_accuracy = final_svm.predict(X_test)
final_accuracy_score = accuracy_score(target_test, final_accuracy)
print ("Final SVM Accuracy: %s" % final_accuracy_score)
Report_Matricies.accuracy(target_test, final_accuracy)
feature_names = zip(cv.get_feature_names(), final_model.coef_[0])
feature_to_coef = {
    word: coef for word, coef in feature_names
}
itemz = feature_to_coef.items()
list_positive = sorted(
    itemz, 
    key=lambda x: x[1], 
    reverse=True)[:number_we_are_interested_in]

所以這有效。 但是當我為 NGram 嘗試類似的代碼時,我得到的是隨機單詞;

   ngram_vectorizer = CountVectorizer(binary=True, ngram_range=(1, no_of_words))
    X = ngram_vectorizer.fit_transform(train_text)
    X_test = ngram_vectorizer.transform(test_text)
    best_c = Logistic_Regression.get_best_hyperparameter(X_train, y_train, y_val, X_val)
    final_ngram = LogisticRegression(C=best_c)
    final_ngram.fit(X, target)
    final_accuracy = final_ngram.predict(X_test)
    final_accuracy_score = accuracy_score(target_test, final_accuracy)
    print ("Final NGram Accuracy: %s" % final_accuracy_score)
    Report_Matricies.accuracy(target_test, final_accuracy)
    feature_names = zip(cv.get_feature_names(), final_ngram.coef_[0])
    feature_to_coef = {
        word: coef for word, coef in feature_names
    }
    itemz = feature_to_coef.items()
    list_positive = sorted(
        itemz, 
        key=lambda x: x[1], 
        reverse=True)

我的 NGram 分析和 SVM 之間的准確率評級相似,所以我用於 NGramm 的代碼似乎不適合提取我想要的那種詞,即它們是隨機詞而不是正面詞。 我應該改用什么代碼? 可以在此參考中找到類似的代碼,但第 2 部分中的示例沒有打印 NGram 的前 5 個單詞。 https://towardsdatascience.com/sentiment-analysis-with-python-part-1-5ce197074184

當您實施在 ngrams 上訓練的邏輯回歸模型時,您似乎做了太多的復制/粘貼。 當您從此模型中獲取feature_names ,您使用的是二進制 CountVectorizer cv ,而不是ngram_vectorizer 我認為你需要改變線路

feature_names = zip(cv.get_feature_names(), final_ngram.coef_[0])

feature_names = zip(ngram_vectorizer.get_feature_names(), final_ngram.coef_[0])

aberger已經回答了,也許你應該替換:

  • “feature_names = zip( cv .get_feature_names(),final_ngram.coef_[0])”由
  • “feature_names = zip( ngram_vectorizer .get_feature_names(),final_ngram.coef_[0])”

一些額外的考慮

在 NLP 中,NGrams 是將 N 個連續單詞視為單個單詞的事實。 它將用於“標記”您的文本語料庫,以使該語料庫可由機器算法使用,但這與算法本身無關。

SVM 和 Logistic 回歸是兩種主要用於分類的不同算法(邏輯回歸是用於分離類的回歸,正是我們使用它的方式使這種回歸成為一種分類算法)。

我試圖用無意義的數據來說明這一點(你可以用你的數據替換),這樣你就可以直接運行這段代碼並觀察結果。

如您所見,使用 NGrams 將給出幾乎相同的頂級詞,除了我自己運行的一個二元詞和一個三元詞:

  • Logistic回歸而不n元語法:[( '的',0.22492305532420143),( '拳擊',0.22366726197682427),( '跳',0.22366726197682427),( '向導',0.22366726197682427),( '五個',0.21116962061694416)]
  • 邏輯回歸以n元語法:[( '的',0.1549468448457053),( '五個',0.15263348614045338),( '拳擊',0.12657434061922093),( '拳擊向導',0.12657434061922093),( '拳擊向導跳',0.12657434061922093)]
  • 與n元語法但排序Logistic回歸僅unigram進行:[( '的',0.1549468448457053),( '五個',0.15263348614045338),( '拳擊',0.12657434061922093),( '跳',0.12657434061922093),( '向導',0.12657434061922093)] <- 給出與“沒有 NGrams 的邏輯回歸”幾乎相同的東西(與使用不同標記學習的模型不完全相同,即這里的額外 NGrams)
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

text_train = ["The quick brown fox jumps over a lazy dog",
        "Pack my box with five dozen liquor jugs",
        "How quickly daft jumping zebras vex",
        "The five boxing wizards jump quickly",
        "the fox of my friend it the most lazy one I have seen in the past five years"]

text_test = ["just for a test"]

target_train = [1, 1, 0, 1, 0]

target_test = [1]

#######################################################################
##       OBSERVING TOKENIZATION OF DATA WITH AND WITOUT NGRAMS       ##
#######################################################################

## WITHOUT NGRAMS

cv = CountVectorizer()
count_vector = cv.fit_transform(text_train)
#Display the dictionary pairing each single word and it's position in the
#"vectorized" version of our text corpus, without any count.
print("")
print(cv.vocabulary_)
print("")
print("")
print(dict(zip(cv.get_feature_names(), count_vector.toarray().sum(axis=0))))

##  WITH NGRAMS

#Now let's also add as meaningfull entities all pair and all trios of words
#using NGrams
cv = CountVectorizer(ngram_range=(1,3))
count_vector = cv.fit_transform(text_train)
#Observe that now, "jump quickly" and "large fawn jumped" for instance are 
#considered as sort of meaningful unique "words" composed of several unique
#words.
print("")
print("")
print(cv.vocabulary_)
print("")
print("")
#List of all words and counts their occurences
print(dict(zip(cv.get_feature_names(), count_vector.toarray().sum(axis=0))))

#######################################################################
##                    YOUR ATTEMPT WITH LINEARSVC                    ##
#######################################################################
cv1 = CountVectorizer(binary=True)
count_vector_train = cv1.fit_transform(text_train)
count_vector_test = cv1.transform(text_test)

final_svm  = LinearSVC(C=1.0)
final_svm.fit(count_vector_train, target_train)
final_accuracy = final_svm.predict(count_vector_test)
final_accuracy_score = accuracy_score(target_test, final_accuracy)
print("")
print("")
print ("Final SVM without NGrams Accuracy: %s" % final_accuracy_score)
feature_names = zip(cv1.get_feature_names(), final_svm.coef_[0])
feature_to_coef = {
    word: coef for word, coef in feature_names
}
itemz = feature_to_coef.items()
list_positive = sorted(
    itemz, 
    key=lambda x: x[1], 
    reverse=True)[:5] #Here you can choose the top 5
print("")
print("SVM without NGrams")
print(list_positive)

#######################################################################
##              YOUR ATTEMPT WITH LOGISTIC REGRESSION                ##
#######################################################################
cv2 = CountVectorizer(binary=True)
count_vector_train = cv2.fit_transform(text_train)
count_vector_test = cv2.transform(text_test)

final_lr  = LogisticRegression(C=1.0)
final_lr.fit(count_vector_train, target_train)
final_accuracy = final_lr.predict(count_vector_test)
final_accuracy_score = accuracy_score(target_test, final_accuracy)
print("")
print("")
print ("Final Logistic regression without NGrams Accuracy: %s" % final_accuracy_score)
feature_names = zip(cv2.get_feature_names(), final_lr.coef_[0])
feature_to_coef = {
    word: coef for word, coef in feature_names
}
itemz = feature_to_coef.items()
list_positive = sorted(
    itemz, 
    key=lambda x: x[1], 
    reverse=True)[:5] #Here you can choose the top 5
print("")
print("Logistic regression without NGrams")
print(list_positive)

#######################################################################
##         YOUR ATTEMPT WITH LOGISTIC REGRESSION AND NGRAMS          ##
#######################################################################
cv3 = CountVectorizer(binary=True, ngram_range=(1,3))
count_vector_train = cv3.fit_transform(text_train)
count_vector_test = cv3.transform(text_test)

final_lr  = LogisticRegression(C=1.0)
final_lr.fit(count_vector_train, target_train)
final_accuracy = final_lr.predict(count_vector_test)
final_accuracy_score = accuracy_score(target_test, final_accuracy)
print("")
print("")
print ("Final Logistic regression with NGrams Accuracy: %s" % final_accuracy_score)
feature_names = zip(cv3.get_feature_names(), final_lr.coef_[0])
feature_to_coef = {
    word: coef for word, coef in feature_names
}
itemz = feature_to_coef.items()
list_positive = sorted(
    itemz, 
    key=lambda x: x[1], 
    reverse=True)[:5] #Here you can choose the top 5
print("")
print("Logistic regression with NGrams")
print(list_positive)

#######################################################################
##         YOUR ATTEMPT WITH LOGISTIC REGRESSION AND NGRAMS          ##
##                BUT EXTRACTS ONLY REAL UNIQUE WORDS                ##
#######################################################################
cv4 = CountVectorizer(binary=True, ngram_range=(1,3))
count_vector_train = cv4.fit_transform(text_train)
count_vector_test = cv4.transform(text_test)

final_lr  = LogisticRegression(C=1.0)
final_lr.fit(count_vector_train, target_train)
final_accuracy = final_lr.predict(count_vector_test)
final_accuracy_score = accuracy_score(target_test, final_accuracy)
print("")
print("")
print ("Final Logistic regression with NGrams Accuracy: %s" % final_accuracy_score)
feature_names = zip(cv4.get_feature_names(), final_lr.coef_[0])
feature_names_unigrams = [(a, b) for a, b in feature_names if len(a.split()) < 2]
feature_to_coef = {
    word: coef for word, coef in feature_names_unigrams
}
itemz = feature_to_coef.items()

list_positive = sorted(
    itemz,
    key=lambda x: x[1], 
    reverse=True)[:5] #Here you can choose the top 5
print("")
print("Logistic regression with NGrams but only getting unigrams")
print(list_positive)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM