In a dataframe, I have rows which include sentences like " machine learning, data, ia, segmentation, analysis " or " big data, data lake, data visualisation, marketing, seo ".
I want to use TF-IDF and kmeans in order to create clusters based on each word.
My problem is that when I use TF-IDFvectorizer, it tokenizes sentences wrongly. I get terms like " analyse analyse " or " english excel " which are not supposed to be put together.
Instead, I would like sentences to be tokenized based on the commas in the sentence. So terms would be " analyse ", " big data ", " data lake ", " english ", etc.
I guess I should change something in the TF-IDFvectorize params but I don't understand how.
Do you please have any idea how to realize this ?
Using keraslibray for Tokenization the sentence in dataframe .Before Tokenization remove the punctation in dataset of dataframe .TF-IDFvectorizer
I am attack the link check it
Check the example code that is help to tokenization of sentence
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.