简体   繁体   English

SciKit-Learn 中的 TFIDF 矢量化器仅返回 5 个结果

[英]TFIDF Vectorizer within SciKit-Learn only returning 5 results

I am currently working with the TFIDF Vectorizer within SciKit-Learn.我目前正在使用 SciKit-Learn 中的 TFIDF Vectorizer。 The Vectorizer is supposed to apply a formula to detect the most frequent word pairs (bigrams) within a Pandas DataFrame. Vectorizer 应该应用一个公式来检测 Pandas DataFrame 中最常见的词对(二元组)。

The below code section however only returns the frequency analysis for five bigrams while the dataset includes thousands of bigrams for which the frequencies should be calculated.然而,下面的代码部分仅返回五个二元组的频率分析,而数据集包含数千个应计算频率的二元组。

Does anyone have a smart idea to get rid of my error that limits the number of calculations to 5 responses?有没有人有一个聪明的主意来摆脱我将计算次数限制为 5 个响应的错误? I have been researching regarding a solution but have not found the right tweak yet.我一直在研究解决方案,但还没有找到合适的调整。

The relevant code section is shown below:相关代码部分如下所示:

def get_top_n_bigram_Group2(corpus, n=None):

    # settings that you use for count vectorizer will go here
    tfidf_vectorizer=TfidfVectorizer(ngram_range=(2, 2), stop_words='english', use_idf=True).fit(corpus)

    # just send in all your docs here
    tfidf_vectorizer_vectors=tfidf_vectorizer.fit_transform(corpus)

    # get the first vector out (for the first document)
    first_vector_tfidfvectorizer=tfidf_vectorizer_vectors[0]

    # place tf-idf values in a pandas data frame
    df1 = pd.DataFrame(first_vector_tfidfvectorizer.T.todense(), index=tfidf_vectorizer.get_feature_names(), columns=["tfidf"])
    df2 = df1.sort_values(by=["tfidf"],ascending=False)

    return df2

And the output code looks like this: output 代码如下所示:

for i in ['txt_pro','txt_con','txt_adviceMgmt','txt_main']:
    # Loop over the common words inside the JSON object
    common_words = get_top_n_bigram_Group2(df[i], 500)
    common_words.to_csv('output.csv')

The proposed changes to achieve what you asked for and also taking into account your comments is as follows:为实现您的要求并考虑到您的意见,建议的更改如下:


def get_top_n_bigram_Group2(corpus, n=None, my_vocabulary=None):

    # settings that you use for count vectorizer will go here
    count_vectorizer=CountVectorizer(ngram_range=(2, 2), 
                                     stop_words='english', 
                                     vocabulary=my_vocabulary,
                                     max_features=n)

    # just send in all your docs here
    count_vectorizer_vectors=count_vectorizer.fit_transform(corpus)

    # Create a list of (bigram, frequency) tuples sorted by their frequency
    sum_bigrams = count_vectorizer_vectors.sum(axis=0) 
    bigram_freq = [(bigram, sum_bigrams[0, idx]) for bigram, idx in count_vectorizer.vocabulary_.items()]
    
    # place bigrams and their frequencies in a pandas data frame
    df1 = pd.DataFrame(bigram_freq, columns=["bigram", "frequency"]).set_index("bigram")
    df1 = df1.sort_values(by=["frequency"],ascending=False)

    return df1

# a list of predefined bigrams
my_vocabulary = ['bigram 1', 'bigram 2', 'bigram 3']
for i in ['text']:
    # Loop over the common words inside the JSON object
    common_words = get_top_n_bigram_Group2(df[i], 500, my_vocabulary)
    common_words.to_csv('output.csv')

If you do not provide the my_vocabulary argument in the get_top_n_bigram_Group2() then the CountVectorizer will count all bigrams without any restriction and will return only the top 500 (or whatever number you request in the second argument).如果您未在get_top_n_bigram_Group2()中提供my_vocabulary参数,则CountVectorizer将不受任何限制地计算所有二元组,并将仅返回前 500 个(或您在第二个参数中请求的任何数字)。

Please let me know if this is what you were looking for.请让我知道这是否是您要找的。 Note that the TFIDF is not returning frequencies but rather scores (or if you prefer 'weights').请注意, TFIDF不是返回频率而是返回分数(或者如果您更喜欢“权重”)。

I would understand the necessity to use TFIDF if you did not have a predefined list of bigrams and you were looking for a way to score among all possible bigrams and wanted to reject those that appear in all documents and have little information power (for example the bigram "it is" appears very frequently in texts but means very little).如果您没有预定义的二元组列表并且您正在寻找一种在所有可能的二元组中评分的方法并且想要拒绝出现在所有文档中并且信息能力很小的那些(例如bigram "it is" 在文本中出现的频率很高,但意义很小)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM