Scikit-Learn TfidfVectorizer

Question

我正在研究文本分类问题，从RSS提要中解析新闻报道，我怀疑许多HTML元素和胡言乱语都被视为标记。 我知道Beautiful Soup提供了清理HTML的方法，但我想尝试传入字典以对计数的令牌有更多控制。

这在概念上似乎很简单，但是我得到的结果我不明白。

from sklearn.feature_extraction.text import TfidfVectorizer

eng_dictionary = []
with open("C:\\Data\\words_alpha.txt") as f:
    eng_dictionary = f.read().splitlines()

short_dic = []
short_dic.append(("short"))
short_dic.append(("story"))

stories = []
stories.append("This is a short story about the color red red red red blue blue blue i am in a car")
stories.append("This is a novel about the color blue red red red red i am in a boot")
stories.append("I like the color green, but prefer blue blue blue blue blue red red red red i am on a bike")

vec = TfidfVectorizer(decode_error=u'ignore', stop_words='english', analyzer='word', lowercase=True)
pos_vector = vec.fit_transform(stories).toarray()

print(vec.get_feature_names())

vec = TfidfVectorizer(decode_error=u'ignore', stop_words='english', analyzer='word', lowercase=True, vocabulary=short_dic)
pos_vector = vec.fit_transform(stories).toarray()

print(vec.get_feature_names())

vec = TfidfVectorizer(decode_error=u'ignore', stop_words='english', analyzer='word', lowercase=True, vocabulary=eng_dictionary)
pos_vector = vec.fit_transform(stories).toarray()

print(vec.get_feature_names())

该程序的输出如下：

['bike', 'blue', 'boot', 'car', 'color', 'green', 'like', 'novel', 'prefer', 'red', 'short', 'story']
['short', 'story']
ptic', 'skeptical', 'skeptically', 'skepticalness', 'skepticism', 'skepticize', 'skepticized', 'skepticizing'...

第三张印刷品的输出是连续不断的，所以我特意缩短了它的输出，尽管如此，但是奇怪的是，它以中间字开始，正如我上面显示的那样。 前两个打印语句的结果对我来说很有意义。

缺少词汇意味着要素是直接从语料库构建的。
提供词汇意味着功能是从语料库和词汇中的标记构建的

但是，第三张图中显示的功能不是我的语料库的一部分，为什么会显示这些功能？

Answer 1

'vocabulary'参数将创建一个TF-IDF矩阵，其中的单词存在于词汇表中。 如果出现单词，则将填充值。

例如，假设“ colors”在您的“ words_alpha.txt”文件中：

              skeptical    skeptically ... ... ...      color
stories[2]        0             0      ... ... ...   TFI-DF value

这就是为什么他们出现。

它以中间单词开头的事实可能与您的文件有关。 您正在使用splitlines（），所以我的猜测是您的文件中有一堆单词，达到一个限制，然后转到“怀疑论者”一词的下一行，这就是您的词汇表（eng_dictionary）的起点

Scikit-Learn TfidfVectorizer

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-08-16 02:05:22

Scikit-Learn TfidfVectorizer

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-08-16 02:05:22

解决方案1
1 已采纳 2017-08-16 02:05:22