简体繁体中英

Tokenize sentence based on existing punctuation (TF-IDF vectorizer)

原文 2022-06-15 13:28:31 9 1 python/ tokenize/ tfidfvectorizer

In a dataframe, I have rows which include sentences like " machine learning, data, ia, segmentation, analysis " or " big data, data lake, data visualisation, marketing, seo ".

I want to use TF-IDF and kmeans in order to create clusters based on each word.

My problem is that when I use TF-IDFvectorizer, it tokenizes sentences wrongly. I get terms like " analyse analyse " or " english excel " which are not supposed to be put together.

Instead, I would like sentences to be tokenized based on the commas in the sentence. So terms would be " analyse ", " big data ", " data lake ", " english ", etc.

I guess I should change something in the TF-IDFvectorize params but I don't understand how.

Do you please have any idea how to realize this ?

1 answers

Using keraslibray for Tokenization the sentence in dataframe .Before Tokenization remove the punctation in dataset of dataframe .TF-IDFvectorizer

I am attack the link check it

Keras

Check the example code that is help to tokenization of sentence

TF-IDF vectorizer with python

TF-IDF vectorizer to extract ngrams

NotFittedError: The TF-IDF vectorizer is not fitted

Converting count vectorizer to tf-idf

Building a TF-IDF Vectorizer from Scratch

TF-IDF Vectorizer Search Query Python

Implementing a TF-IDF Vectorizer from Scratch

faster sklearn tf-idf vectorizer

tf-idf vectorizer's use_idf parameter explanation

How to create a scikit pipeline for tf-idf vectorizer?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question TF-IDF vectorizer with python TF-IDF vectorizer to extract ngrams NotFittedError: The TF-IDF vectorizer is not fitted Converting count vectorizer to tf-idf Building a TF-IDF Vectorizer from Scratch TF-IDF Vectorizer Search Query Python Implementing a TF-IDF Vectorizer from Scratch faster sklearn tf-idf vectorizer tf-idf vectorizer's use_idf parameter explanation How to create a scikit pipeline for tf-idf vectorizer?

Related Tags

Tokenize sentence based on existing punctuation (TF-IDF vectorizer)

Question

1 answers

solution1 0 ACCPTED 2022-06-15 16:20:24

solution1
0 ACCPTED 2022-06-15 16:20:24