[英]Custom tokenizer for scikit-learn vectorizers
給出以下文件清單:
docs = [
'feature one`feature two`feature three',
'feature one`feature two`feature four',
'feature one'
]
我想使用scikit中的兩個矢量化器類( CountVectorizer
或TfidfVectorizer
),其中'feature one'
, 'feature two'
, 'feature three'
和'feature four'
應該是矩陣中表示的四個功能。
我嘗試了這個:
vec = CountVectorizer(token_pattern='(?u)\w+\s.\w.`')
但這僅返回以下內容:
['feature one`', 'feature two`']
如果您已將功能固定為
'feature one', 'feature two', 'feature three', and 'feature four'
然后您還可以使用vocabulary
參數。
vocab = ['feature one', 'feature two', 'feature three', 'feature four']
vec = CountVectorizer(vocabulary=vocab)
X = vec.fit_transform(docs)
vec.get_feature_names()
Out[310]:
['feature one',
'feature two',
'feature three',
'feature four']
In [295]: vec = CountVectorizer(token_pattern='(?u)\w+[\s\`]\w+')
In [296]: X = vec.fit_transform(docs)
In [297]: vec.get_feature_names()
Out[297]: ['feature four', 'feature one', 'feature three', 'feature two']
您可能還需要考慮使用ngram_range=(2,2)
,它將產生以下結果:
In [308]: vec = CountVectorizer(ngram_range=(2,2))
In [309]: X = vec.fit_transform(docs)
In [310]: vec.get_feature_names()
Out[310]:
['feature four',
'feature one',
'feature three',
'feature two',
'one feature',
'two feature']
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.