简体   繁体   中英

Predicting new content for text-clustering using sklearn

I am trying to understand how to create clustering of texts using sklearn. I have 800 hundred texts (600 training data and 200 test data) like the following:

Texts # columns name

  1 Donald Trump, Donald Trump news, Trump bleach, Trump injected bleach, bleach coronavirus.
  2 Thank you Janey.......laughing so much at this........you have saved my sanity in these mad times. Only bleach Trump is using is on his heed 🤣
  3 His more uncharitable critics said Trump had suggested that Americans drink bleach. Trump responded that he was being sarcastic.
  4 Outcry after Trump suggests injecting disinfectant as treatment.
  5 Trump Suggested 'Injecting' Disinfectant to Cure Coronavirus?
  6 The study also showed that bleach and isopropyl alcohol killed the virus in saliva or respiratory fluids in a matter of minutes.

and I would like create clusters from those. To transform the corpus into vector space I have used tf-idf and to cluster the documents using the k-means algorithm. However, I cannot understand if the results are those expected or not as unfortunately the output is not 'graphical' (I have tried to use CountVectorizer to have a matrix of frequency, but probably I am using it in the wrong way). What I would expect by doing tf-idf is that when I test the test dataset When I TEST:

test_dataset = ["'Please don't inject bleach': Trump's wild coronavirus claims prompt disbelief.", "Donald Trump has won the shock and ire of the scientific and medical communities after suggesting bogus treatments for Covid-19", "Bleach manufacturers have warned people not to inject themselves with disinfectant after Trump falsely suggested it might cure the coronavirus."]

(the test dataset comes from the column df["0"]['Names'] ) I would like to see which cluster(made by k-means) the texts belongs to. Please see below the code that I am currently using:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import re
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk
from nltk.corpus import stopwords

stop_words = stopwords.words('english')

def preprocessing(line):
    line = re.sub(r"[^a-zA-Z]", " ", line.lower())
    words = word_tokenize(line)
    words_lemmed = [WordNetLemmatizer().lemmatize(w) for w in words if w not in stop_words]
    return words_lemmed

tfidf_vectorizer = TfidfVectorizer(tokenizer=preprocessing)
vec = CountVectorizer()

tfidf = tfidf_vectorizer.fit_transform(df["0"]['Names'])
matrix = vec.fit_transform(df["0"]['Names'])

kmeans = KMeans(n_clusters=2).fit(tfidf)
pd.DataFrame(matrix.toarray(), columns=vec.get_feature_names())

where df["0"]['Names'] is the column ' Names ' of the 0th dataframe. A visual example, even with a different dataset but pretty same structure of dataframe (just for a better understanding) would be also good, if you prefer.

All the help you will provide will be greatly appreciated. Thanks

taking your test_data and adding three more sentence to make corpus

train_data = ["'Please don't inject bleach': Trump's wild coronavirus claims prompt disbelief.",
              "Donald Trump has won the shock and ire of the scientific and medical communities after suggesting bogus treatments for Covid-19", 
              "Bleach manufacturers have warned people not to inject themselves with disinfectant after Trump falsely suggested it might cure the coronavirus.",
              "find the most representative document for each topic",
              "topic distribution across documents",
               "to help with understanding the topic",
                "one of the practical application of topic modeling is to determine"]

creating dataframe from above dataset

 df = pd.DataFrame(train_data, columns = 'text')

now you can use either Countvectorizer or TfidfVectorizer for vectorizing text, i am using TfidfVectorizer

 vect = TfidfVectorizer(tokenizer=preprocessing)

 vectorized_text = vect.fit_transform(df['text'])

 kmeans = KMeans(n_clusters=2).fit(vectorized_text)

 # now predicting the cluster for given dataset

df['predicted cluster'] = kmeans.predict(vectorized_text)

在此处输入图像描述

Now, when you are going to predict for test data or new data

new_sent = 'coronavirus has created lot of problem in the world'
kmeans.predict(vect.transform([new_sent])) #you have to use transform only and not fit_transfrom 

#op
array([1])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM