簡體   English   中英

tf-idf 用於文本聚類分析

[英]tf-idf for text cluster-analysis

我想將 dataframe 中包含在df['Texts']列中的小文本分組。 要分析的句子示例如下:

    Texts

  1 Donald Trump, Donald Trump news, Trump bleach, Trump injected bleach, bleach coronavirus.
  2 Thank you Janey.......laughing so much at this........you have saved my sanity in these mad times. Only bleach Trump is using is on his heed 🤣
  3 His more uncharitable critics said Trump had suggested that Americans drink bleach. Trump responded that he was being sarcastic.
  4 Outcry after Trump suggests injecting disinfectant as treatment.
  5 Trump Suggested 'Injecting' Disinfectant to Cure Coronavirus?
  6 The study also showed that bleach and isopropyl alcohol killed the virus in saliva or respiratory fluids in a matter of minutes.

因為我知道 TF-IDF 對集群很有用,所以我一直在使用以下代碼行(通過關注社區中的一些先前問題):

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import re
import string

def preprocessing(line):
    line = line.lower()
    line = re.sub(r"[{}]".format(string.punctuation), " ", line)
    return line

tfidf_vectorizer = TfidfVectorizer(preprocessor=preprocessing)
tfidf = tfidf_vectorizer.fit_transform(all_text)

kmeans = KMeans(n_clusters=2).fit(tfidf) # the number of clusters could be manually changed

但是,由於我正在考慮來自 dataframe 的列,因此我不知道如何應用上述 function。 你能幫我嗎?

def preprocessing(line):
    line = line.lower()
    line = re.sub(r"[{}]".format(string.punctuation), " ", line)
    return line

tfidf_vectorizer = TfidfVectorizer(preprocessor=preprocessing)
tfidf = tfidf_vectorizer.fit_transform(df['Texts'])

kmeans = KMeans(n_clusters=2).fit(tfidf)

您只需要將 all_text 替換為您的 df。 最好先構建一個管道,然后同時應用矢量化器和 Kmeans。

此外,為了獲得更精確的結果,對文本進行更多預處理絕不是一個壞主意。 另外,但是我不認為降低文本是一個好主意,因為你自然會刪除一個很好的寫作風格特征(如果我們認為你想找到作者或將作者分配給一個組)而是為了獲得句子的情緒是的最好降低。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM