简体   繁体   中英

TF-IDF vectorizer with python

I have a problem with the TfidfVectorizer function in python. For example if I have a string like this one: 'xxx//xx. aaa.bb.ccc.d' will be extracted these words as the key of the dictionary: 'xxx', 'xx', 'aaa', 'bb', 'ccc', 'd' instead, I want to create these new features: 'xxx//xx.', 'aaa.bb.ccc.d'

How can I ask to TfidfVectorizer function to select words separated by the space (' ')?

token-pattern parameter in TfidfVectorizer used to specify custom split pattern

from sklearn.feature_extraction.text import TfidfVectorizer
a = ['xxx//xx. aaa.bb.ccc.d']  
t = TfidfVectorizer(token_pattern=r"([a-z]*//[a-z]*)|([a-z.]*)")

Ouputs

[('', ''), ('', '.'), ('', 'aaa.bb.ccc.d'), ('xxx//xx', '')]

Some post cleaning is required in this case.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM