SciKit-Learn 中的 TFIDF 矢量化器僅返回 5 個結果

Question

我目前正在使用 SciKit-Learn 中的 TFIDF Vectorizer。 Vectorizer 應該應用一個公式來檢測 Pandas DataFrame 中最常見的詞對（二元組）。

然而，下面的代碼部分僅返回五個二元組的頻率分析，而數據集包含數千個應計算頻率的二元組。

有沒有人有一個聰明的主意來擺脫我將計算次數限制為 5 個響應的錯誤？ 我一直在研究解決方案，但還沒有找到合適的調整。

相關代碼部分如下所示：

def get_top_n_bigram_Group2(corpus, n=None):

    # settings that you use for count vectorizer will go here
    tfidf_vectorizer=TfidfVectorizer(ngram_range=(2, 2), stop_words='english', use_idf=True).fit(corpus)

    # just send in all your docs here
    tfidf_vectorizer_vectors=tfidf_vectorizer.fit_transform(corpus)

    # get the first vector out (for the first document)
    first_vector_tfidfvectorizer=tfidf_vectorizer_vectors[0]

    # place tf-idf values in a pandas data frame
    df1 = pd.DataFrame(first_vector_tfidfvectorizer.T.todense(), index=tfidf_vectorizer.get_feature_names(), columns=["tfidf"])
    df2 = df1.sort_values(by=["tfidf"],ascending=False)

    return df2

output 代碼如下所示：

for i in ['txt_pro','txt_con','txt_adviceMgmt','txt_main']:
    # Loop over the common words inside the JSON object
    common_words = get_top_n_bigram_Group2(df[i], 500)
    common_words.to_csv('output.csv')

Answer 1

為實現您的要求並考慮到您的意見，建議的更改如下：

def get_top_n_bigram_Group2(corpus, n=None, my_vocabulary=None):

    # settings that you use for count vectorizer will go here
    count_vectorizer=CountVectorizer(ngram_range=(2, 2), 
                                     stop_words='english', 
                                     vocabulary=my_vocabulary,
                                     max_features=n)

    # just send in all your docs here
    count_vectorizer_vectors=count_vectorizer.fit_transform(corpus)

    # Create a list of (bigram, frequency) tuples sorted by their frequency
    sum_bigrams = count_vectorizer_vectors.sum(axis=0) 
    bigram_freq = [(bigram, sum_bigrams[0, idx]) for bigram, idx in count_vectorizer.vocabulary_.items()]
    
    # place bigrams and their frequencies in a pandas data frame
    df1 = pd.DataFrame(bigram_freq, columns=["bigram", "frequency"]).set_index("bigram")
    df1 = df1.sort_values(by=["frequency"],ascending=False)

    return df1

# a list of predefined bigrams
my_vocabulary = ['bigram 1', 'bigram 2', 'bigram 3']
for i in ['text']:
    # Loop over the common words inside the JSON object
    common_words = get_top_n_bigram_Group2(df[i], 500, my_vocabulary)
    common_words.to_csv('output.csv')

如果您未在get_top_n_bigram_Group2()中提供my_vocabulary參數，則CountVectorizer將不受任何限制地計算所有二元組，並將僅返回前 500 個（或您在第二個參數中請求的任何數字）。

請讓我知道這是否是您要找的。 請注意， TFIDF不是返回頻率而是返回分數（或者如果您更喜歡“權重”）。

如果您沒有預定義的二元組列表並且您正在尋找一種在所有可能的二元組中評分的方法並且想要拒絕出現在所有文檔中並且信息能力很小的那些（例如bigram "it is" 在文本中出現的頻率很高，但意義很小）。

SciKit-Learn 中的 TFIDF 矢量化器僅返回 5 個結果

問題描述

1 個解決方案

解決方案1
1 2020-07-16 10:40:02

SciKit-Learn 中的 TFIDF 矢量化器僅返回 5 個結果

問題描述

1 個解決方案

解決方案1 1 2020-07-16 10:40:02

解決方案1
1 2020-07-16 10:40:02