tf-idf 用於文本聚類分析

Question

我想將 dataframe 中包含在df['Texts']列中的小文本分組。 要分析的句子示例如下：

    Texts

  1 Donald Trump, Donald Trump news, Trump bleach, Trump injected bleach, bleach coronavirus.
  2 Thank you Janey.......laughing so much at this........you have saved my sanity in these mad times. Only bleach Trump is using is on his heed 🤣
  3 His more uncharitable critics said Trump had suggested that Americans drink bleach. Trump responded that he was being sarcastic.
  4 Outcry after Trump suggests injecting disinfectant as treatment.
  5 Trump Suggested 'Injecting' Disinfectant to Cure Coronavirus?
  6 The study also showed that bleach and isopropyl alcohol killed the virus in saliva or respiratory fluids in a matter of minutes.

因為我知道 TF-IDF 對集群很有用，所以我一直在使用以下代碼行（通過關注社區中的一些先前問題）：

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import re
import string

def preprocessing(line):
    line = line.lower()
    line = re.sub(r"[{}]".format(string.punctuation), " ", line)
    return line

tfidf_vectorizer = TfidfVectorizer(preprocessor=preprocessing)
tfidf = tfidf_vectorizer.fit_transform(all_text)

kmeans = KMeans(n_clusters=2).fit(tfidf) # the number of clusters could be manually changed

但是，由於我正在考慮來自 dataframe 的列，因此我不知道如何應用上述 function。 你能幫我嗎？

Answer 1

def preprocessing(line):
    line = line.lower()
    line = re.sub(r"[{}]".format(string.punctuation), " ", line)
    return line

tfidf_vectorizer = TfidfVectorizer(preprocessor=preprocessing)
tfidf = tfidf_vectorizer.fit_transform(df['Texts'])

kmeans = KMeans(n_clusters=2).fit(tfidf)

您只需要將 all_text 替換為您的 df。 最好先構建一個管道，然后同時應用矢量化器和 Kmeans。

此外，為了獲得更精確的結果，對文本進行更多預處理絕不是一個壞主意。 另外，但是我不認為降低文本是一個好主意，因為你自然會刪除一個很好的寫作風格特征（如果我們認為你想找到作者或將作者分配給一個組）而是為了獲得句子的情緒是的最好降低。

tf-idf 用於文本聚類分析

問題描述

1 個解決方案

解決方案1
2 已采納 2020-05-11 17:29:37

tf-idf 用於文本聚類分析

問題描述

1 個解決方案

解決方案1 2 已采納 2020-05-11 17:29:37

解決方案1
2 已采納 2020-05-11 17:29:37