简体   繁体   中英

TFIDF Vectorizer within SciKit-Learn only returning 5 results

I am currently working with the TFIDF Vectorizer within SciKit-Learn. The Vectorizer is supposed to apply a formula to detect the most frequent word pairs (bigrams) within a Pandas DataFrame.

The below code section however only returns the frequency analysis for five bigrams while the dataset includes thousands of bigrams for which the frequencies should be calculated.

Does anyone have a smart idea to get rid of my error that limits the number of calculations to 5 responses? I have been researching regarding a solution but have not found the right tweak yet.

The relevant code section is shown below:

def get_top_n_bigram_Group2(corpus, n=None):

    # settings that you use for count vectorizer will go here
    tfidf_vectorizer=TfidfVectorizer(ngram_range=(2, 2), stop_words='english', use_idf=True).fit(corpus)

    # just send in all your docs here
    tfidf_vectorizer_vectors=tfidf_vectorizer.fit_transform(corpus)

    # get the first vector out (for the first document)
    first_vector_tfidfvectorizer=tfidf_vectorizer_vectors[0]

    # place tf-idf values in a pandas data frame
    df1 = pd.DataFrame(first_vector_tfidfvectorizer.T.todense(), index=tfidf_vectorizer.get_feature_names(), columns=["tfidf"])
    df2 = df1.sort_values(by=["tfidf"],ascending=False)

    return df2

And the output code looks like this:

for i in ['txt_pro','txt_con','txt_adviceMgmt','txt_main']:
    # Loop over the common words inside the JSON object
    common_words = get_top_n_bigram_Group2(df[i], 500)
    common_words.to_csv('output.csv')

The proposed changes to achieve what you asked for and also taking into account your comments is as follows:


def get_top_n_bigram_Group2(corpus, n=None, my_vocabulary=None):

    # settings that you use for count vectorizer will go here
    count_vectorizer=CountVectorizer(ngram_range=(2, 2), 
                                     stop_words='english', 
                                     vocabulary=my_vocabulary,
                                     max_features=n)

    # just send in all your docs here
    count_vectorizer_vectors=count_vectorizer.fit_transform(corpus)

    # Create a list of (bigram, frequency) tuples sorted by their frequency
    sum_bigrams = count_vectorizer_vectors.sum(axis=0) 
    bigram_freq = [(bigram, sum_bigrams[0, idx]) for bigram, idx in count_vectorizer.vocabulary_.items()]
    
    # place bigrams and their frequencies in a pandas data frame
    df1 = pd.DataFrame(bigram_freq, columns=["bigram", "frequency"]).set_index("bigram")
    df1 = df1.sort_values(by=["frequency"],ascending=False)

    return df1

# a list of predefined bigrams
my_vocabulary = ['bigram 1', 'bigram 2', 'bigram 3']
for i in ['text']:
    # Loop over the common words inside the JSON object
    common_words = get_top_n_bigram_Group2(df[i], 500, my_vocabulary)
    common_words.to_csv('output.csv')

If you do not provide the my_vocabulary argument in the get_top_n_bigram_Group2() then the CountVectorizer will count all bigrams without any restriction and will return only the top 500 (or whatever number you request in the second argument).

Please let me know if this is what you were looking for. Note that the TFIDF is not returning frequencies but rather scores (or if you prefer 'weights').

I would understand the necessity to use TFIDF if you did not have a predefined list of bigrams and you were looking for a way to score among all possible bigrams and wanted to reject those that appear in all documents and have little information power (for example the bigram "it is" appears very frequently in texts but means very little).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM