简体   繁体   中英

Tokenize sentence based on existing punctuation (TF-IDF vectorizer)

In a dataframe, I have rows which include sentences like " machine learning, data, ia, segmentation, analysis " or " big data, data lake, data visualisation, marketing, seo ".

I want to use TF-IDF and kmeans in order to create clusters based on each word.

My problem is that when I use TF-IDFvectorizer, it tokenizes sentences wrongly. I get terms like " analyse analyse " or " english excel " which are not supposed to be put together.

Instead, I would like sentences to be tokenized based on the commas in the sentence. So terms would be " analyse ", " big data ", " data lake ", " english ", etc.

I guess I should change something in the TF-IDFvectorize params but I don't understand how.

Do you please have any idea how to realize this ?

Using keraslibray for Tokenization the sentence in dataframe .Before Tokenization remove the punctation in dataset of dataframe .TF-IDFvectorizer

I am attack the link check it

Keras

Check the example code that is help to tokenization of sentence

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM