简体   繁体   中英

Scikit-Learn TfidfVectorizer

I am working on a text classification problem, parsing news stories from RSS feeds, and I suspect many HTML elements and gibberish are being counted as tokens. I know Beautiful Soup provides methods to clean up HTML, but I wanted to try to pass in a dictionary to have more control over what tokens were counted.

This seemed simple enough in concept, but I am getting results I don't understand.

from sklearn.feature_extraction.text import TfidfVectorizer

eng_dictionary = []
with open("C:\\Data\\words_alpha.txt") as f:
    eng_dictionary = f.read().splitlines()

short_dic = []
short_dic.append(("short"))
short_dic.append(("story"))

stories = []
stories.append("This is a short story about the color red red red red blue blue blue i am in a car")
stories.append("This is a novel about the color blue red red red red i am in a boot")
stories.append("I like the color green, but prefer blue blue blue blue blue red red red red i am on a bike")

vec = TfidfVectorizer(decode_error=u'ignore', stop_words='english', analyzer='word', lowercase=True)
pos_vector = vec.fit_transform(stories).toarray()

print(vec.get_feature_names())

vec = TfidfVectorizer(decode_error=u'ignore', stop_words='english', analyzer='word', lowercase=True, vocabulary=short_dic)
pos_vector = vec.fit_transform(stories).toarray()

print(vec.get_feature_names())

vec = TfidfVectorizer(decode_error=u'ignore', stop_words='english', analyzer='word', lowercase=True, vocabulary=eng_dictionary)
pos_vector = vec.fit_transform(stories).toarray()

print(vec.get_feature_names())

The output of the program is as follows;

['bike', 'blue', 'boot', 'car', 'color', 'green', 'like', 'novel', 'prefer', 'red', 'short', 'story']
['short', 'story']
ptic', 'skeptical', 'skeptically', 'skepticalness', 'skepticism', 'skepticize', 'skepticized', 'skepticizing'...

The output of the third print goes on and on, so I intentionally cut it short, What is weird though is that it starts mid-word, exactly as I show it above. The results of the first two print statements make sense to me;

  • Absence of a vocabulary means features are built directly from corpus.
  • Providing a vocabulary means features are built from tokens in corpus and in vocabulary

However, the features showing in the third print are not part of my corpus, why are they showing up?

The 'vocabulary' parameter will create a TF-IDF matrix with the words present in the vocabulary. Then the values will be populated if the word is present.

For example, assuming 'color' is in your 'words_alpha.txt' file:

              skeptical    skeptically ... ... ...      color
stories[2]        0             0      ... ... ...   TFI-DF value

That's why they're showing up.

The fact that it is starting mid-word is probably related to your file. You're using splitlines(), so my guess is that your file has a bunch of words, reaches a limit, then goes to the next line in the middle of the word 'skeptic, and that's where your vocabulary (eng_dictionary) is starting

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM