TF-IDF vectorizer with python

Question

I have a problem with the TfidfVectorizer function in python. For example if I have a string like this one: 'xxx//xx. aaa.bb.ccc.d' will be extracted these words as the key of the dictionary: 'xxx', 'xx', 'aaa', 'bb', 'ccc', 'd' instead, I want to create these new features: 'xxx//xx.', 'aaa.bb.ccc.d'

How can I ask to TfidfVectorizer function to select words separated by the space (' ')?

Answer 1

Have a look at: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

There is a parameter called token-pattern.

Answer 2

token-pattern parameter in TfidfVectorizer used to specify custom split pattern

from sklearn.feature_extraction.text import TfidfVectorizer
a = ['xxx//xx. aaa.bb.ccc.d']  
t = TfidfVectorizer(token_pattern=r"([a-z]*//[a-z]*)|([a-z.]*)")

Ouputs

[('', ''), ('', '.'), ('', 'aaa.bb.ccc.d'), ('xxx//xx', '')]

Some post cleaning is required in this case.

TF-IDF vectorizer with python

Question

2 answers

solution1
0 2020-05-10 09:54:33

solution2
0 2020-05-10 10:01:49

TF-IDF vectorizer with python

Question

2 answers

solution1 0 2020-05-10 09:54:33

solution2 0 2020-05-10 10:01:49

solution1
0 2020-05-10 09:54:33

solution2
0 2020-05-10 10:01:49