I have a problem with the TfidfVectorizer function in python. For example if I have a string like this one: 'xxx//xx. aaa.bb.ccc.d' will be extracted these words as the key of the dictionary: 'xxx', 'xx', 'aaa', 'bb', 'ccc', 'd' instead, I want to create these new features: 'xxx//xx.', 'aaa.bb.ccc.d'
How can I ask to TfidfVectorizer function to select words separated by the space (' ')?
Have a look at: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
There is a parameter called token-pattern.
token-pattern
parameter in TfidfVectorizer used to specify custom split pattern
from sklearn.feature_extraction.text import TfidfVectorizer
a = ['xxx//xx. aaa.bb.ccc.d']
t = TfidfVectorizer(token_pattern=r"([a-z]*//[a-z]*)|([a-z.]*)")
Ouputs
[('', ''), ('', '.'), ('', 'aaa.bb.ccc.d'), ('xxx//xx', '')]
Some post cleaning is required in this case.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.