[英]Custom tokenizer for scikit-learn vectorizers
Given the following list of documents: 给出以下文件清单:
docs = [
'feature one`feature two`feature three',
'feature one`feature two`feature four',
'feature one'
]
I want to use either of the vectorizer classes in scikit ( CountVectorizer
or TfidfVectorizer
), with 'feature one'
, 'feature two'
, 'feature three'
, and 'feature four'
should be the four features represented in the matrix. 我想使用scikit中的两个矢量化器类(
CountVectorizer
或TfidfVectorizer
),其中'feature one'
, 'feature two'
, 'feature three'
和'feature four'
应该是矩阵中表示的四个功能。
I tried this: 我尝试了这个:
vec = CountVectorizer(token_pattern='(?u)\w+\s.\w.`')
But that returns only this: 但这仅返回以下内容:
['feature one`', 'feature two`']
If you have fixed the features to be 如果您已将功能固定为
'feature one', 'feature two', 'feature three', and 'feature four'
then you can also use the vocabulary
param. 然后您还可以使用
vocabulary
参数。
vocab = ['feature one', 'feature two', 'feature three', 'feature four']
vec = CountVectorizer(vocabulary=vocab)
X = vec.fit_transform(docs)
vec.get_feature_names()
Out[310]:
['feature one',
'feature two',
'feature three',
'feature four']
In [295]: vec = CountVectorizer(token_pattern='(?u)\w+[\s\`]\w+')
In [296]: X = vec.fit_transform(docs)
In [297]: vec.get_feature_names()
Out[297]: ['feature four', 'feature one', 'feature three', 'feature two']
you may also want to consider using ngram_range=(2,2)
, which would produce the following: 您可能还需要考虑使用
ngram_range=(2,2)
,它将产生以下结果:
In [308]: vec = CountVectorizer(ngram_range=(2,2))
In [309]: X = vec.fit_transform(docs)
In [310]: vec.get_feature_names()
Out[310]:
['feature four',
'feature one',
'feature three',
'feature two',
'one feature',
'two feature']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.