I am working on a text classification problem, parsing news stories from RSS feeds, and I suspect many HTML elements and gibberish are being counted as tokens. I know Beautiful Soup provides methods to clean up HTML, but I wanted to try to pass in a dictionary to have more control over what tokens were counted.
This seemed simple enough in concept, but I am getting results I don't understand.
from sklearn.feature_extraction.text import TfidfVectorizer
eng_dictionary = []
with open("C:\\Data\\words_alpha.txt") as f:
eng_dictionary = f.read().splitlines()
short_dic = []
short_dic.append(("short"))
short_dic.append(("story"))
stories = []
stories.append("This is a short story about the color red red red red blue blue blue i am in a car")
stories.append("This is a novel about the color blue red red red red i am in a boot")
stories.append("I like the color green, but prefer blue blue blue blue blue red red red red i am on a bike")
vec = TfidfVectorizer(decode_error=u'ignore', stop_words='english', analyzer='word', lowercase=True)
pos_vector = vec.fit_transform(stories).toarray()
print(vec.get_feature_names())
vec = TfidfVectorizer(decode_error=u'ignore', stop_words='english', analyzer='word', lowercase=True, vocabulary=short_dic)
pos_vector = vec.fit_transform(stories).toarray()
print(vec.get_feature_names())
vec = TfidfVectorizer(decode_error=u'ignore', stop_words='english', analyzer='word', lowercase=True, vocabulary=eng_dictionary)
pos_vector = vec.fit_transform(stories).toarray()
print(vec.get_feature_names())
The output of the program is as follows;
['bike', 'blue', 'boot', 'car', 'color', 'green', 'like', 'novel', 'prefer', 'red', 'short', 'story']
['short', 'story']
ptic', 'skeptical', 'skeptically', 'skepticalness', 'skepticism', 'skepticize', 'skepticized', 'skepticizing'...
The output of the third print goes on and on, so I intentionally cut it short, What is weird though is that it starts mid-word, exactly as I show it above. The results of the first two print statements make sense to me;
However, the features showing in the third print are not part of my corpus, why are they showing up?
The 'vocabulary' parameter will create a TF-IDF matrix with the words present in the vocabulary. Then the values will be populated if the word is present.
For example, assuming 'color' is in your 'words_alpha.txt' file:
skeptical skeptically ... ... ... color
stories[2] 0 0 ... ... ... TFI-DF value
That's why they're showing up.
The fact that it is starting mid-word is probably related to your file. You're using splitlines(), so my guess is that your file has a bunch of words, reaches a limit, then goes to the next line in the middle of the word 'skeptic, and that's where your vocabulary (eng_dictionary) is starting
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.