Scikit-Learn TfidfVectorizer

Question

I am working on a text classification problem, parsing news stories from RSS feeds, and I suspect many HTML elements and gibberish are being counted as tokens. I know Beautiful Soup provides methods to clean up HTML, but I wanted to try to pass in a dictionary to have more control over what tokens were counted.

This seemed simple enough in concept, but I am getting results I don't understand.

from sklearn.feature_extraction.text import TfidfVectorizer

eng_dictionary = []
with open("C:\\Data\\words_alpha.txt") as f:
    eng_dictionary = f.read().splitlines()

short_dic = []
short_dic.append(("short"))
short_dic.append(("story"))

stories = []
stories.append("This is a short story about the color red red red red blue blue blue i am in a car")
stories.append("This is a novel about the color blue red red red red i am in a boot")
stories.append("I like the color green, but prefer blue blue blue blue blue red red red red i am on a bike")

vec = TfidfVectorizer(decode_error=u'ignore', stop_words='english', analyzer='word', lowercase=True)
pos_vector = vec.fit_transform(stories).toarray()

print(vec.get_feature_names())

vec = TfidfVectorizer(decode_error=u'ignore', stop_words='english', analyzer='word', lowercase=True, vocabulary=short_dic)
pos_vector = vec.fit_transform(stories).toarray()

print(vec.get_feature_names())

vec = TfidfVectorizer(decode_error=u'ignore', stop_words='english', analyzer='word', lowercase=True, vocabulary=eng_dictionary)
pos_vector = vec.fit_transform(stories).toarray()

print(vec.get_feature_names())

The output of the program is as follows;

['bike', 'blue', 'boot', 'car', 'color', 'green', 'like', 'novel', 'prefer', 'red', 'short', 'story']
['short', 'story']
ptic', 'skeptical', 'skeptically', 'skepticalness', 'skepticism', 'skepticize', 'skepticized', 'skepticizing'...

The output of the third print goes on and on, so I intentionally cut it short, What is weird though is that it starts mid-word, exactly as I show it above. The results of the first two print statements make sense to me;

Absence of a vocabulary means features are built directly from corpus.
Providing a vocabulary means features are built from tokens in corpus and in vocabulary

However, the features showing in the third print are not part of my corpus, why are they showing up?

Answer 1

The 'vocabulary' parameter will create a TF-IDF matrix with the words present in the vocabulary. Then the values will be populated if the word is present.

For example, assuming 'color' is in your 'words_alpha.txt' file:

              skeptical    skeptically ... ... ...      color
stories[2]        0             0      ... ... ...   TFI-DF value

That's why they're showing up.

The fact that it is starting mid-word is probably related to your file. You're using splitlines(), so my guess is that your file has a bunch of words, reaches a limit, then goes to the next line in the middle of the word 'skeptic, and that's where your vocabulary (eng_dictionary) is starting

Scikit-Learn TfidfVectorizer

Question

1 answers

solution1
1 ACCPTED 2017-08-16 02:05:22

Scikit-Learn TfidfVectorizer

Question

1 answers

solution1 1 ACCPTED 2017-08-16 02:05:22

solution1
1 ACCPTED 2017-08-16 02:05:22