Custom tokenizer for scikit-learn vectorizers

Question

Given the following list of documents:

docs = [
'feature one`feature two`feature three',
'feature one`feature two`feature four',
'feature one'
]

I want to use either of the vectorizer classes in scikit ( CountVectorizer or TfidfVectorizer ), with 'feature one' , 'feature two' , 'feature three' , and 'feature four' should be the four features represented in the matrix.

I tried this:

vec = CountVectorizer(token_pattern='(?u)\w+\s.\w.`')

But that returns only this:

['feature one`', 'feature two`']

Answer 1

If you have fixed the features to be

'feature one', 'feature two', 'feature three', and 'feature four'

then you can also use the vocabulary param.

vocab = ['feature one', 'feature two', 'feature three', 'feature four']
vec = CountVectorizer(vocabulary=vocab)

X = vec.fit_transform(docs)
vec.get_feature_names()
Out[310]:
['feature one',
 'feature two',
 'feature three',
 'feature four']

Answer 2

In [295]: vec = CountVectorizer(token_pattern='(?u)\w+[\s\`]\w+')

In [296]: X = vec.fit_transform(docs)

In [297]: vec.get_feature_names()
Out[297]: ['feature four', 'feature one', 'feature three', 'feature two']

you may also want to consider using ngram_range=(2,2) , which would produce the following:

In [308]: vec = CountVectorizer(ngram_range=(2,2))

In [309]: X = vec.fit_transform(docs)

In [310]: vec.get_feature_names()
Out[310]:
['feature four',
 'feature one',
 'feature three',
 'feature two',
 'one feature',
 'two feature']

Custom tokenizer for scikit-learn vectorizers

Question

2 answers

solution1
2 2018-02-23 04:49:40

solution2
1 ACCPTED 2018-02-23 00:37:58

Custom tokenizer for scikit-learn vectorizers

Question

2 answers

solution1 2 2018-02-23 04:49:40

solution2 1 ACCPTED 2018-02-23 00:37:58

solution1
2 2018-02-23 04:49:40

solution2
1 ACCPTED 2018-02-23 00:37:58