tf-idf 用于文本聚类分析

Question

I would like to group small texts included in a column, df['Texts'] , from a dataframe.我想将 dataframe 中包含在df['Texts']列中的小文本分组。 An example of sentences to analyse are as follows:要分析的句子示例如下：

    Texts

  1 Donald Trump, Donald Trump news, Trump bleach, Trump injected bleach, bleach coronavirus.
  2 Thank you Janey.......laughing so much at this........you have saved my sanity in these mad times. Only bleach Trump is using is on his heed 🤣
  3 His more uncharitable critics said Trump had suggested that Americans drink bleach. Trump responded that he was being sarcastic.
  4 Outcry after Trump suggests injecting disinfectant as treatment.
  5 Trump Suggested 'Injecting' Disinfectant to Cure Coronavirus?
  6 The study also showed that bleach and isopropyl alcohol killed the virus in saliva or respiratory fluids in a matter of minutes.

Since I know that TF-IDF is useful for clustering, I have been using the following lines of code (by following some previous questions in the community):因为我知道 TF-IDF 对集群很有用，所以我一直在使用以下代码行（通过关注社区中的一些先前问题）：

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import re
import string

def preprocessing(line):
    line = line.lower()
    line = re.sub(r"[{}]".format(string.punctuation), " ", line)
    return line

tfidf_vectorizer = TfidfVectorizer(preprocessor=preprocessing)
tfidf = tfidf_vectorizer.fit_transform(all_text)

kmeans = KMeans(n_clusters=2).fit(tfidf) # the number of clusters could be manually changed

However, since I am considering a column from a dataframe, I do not know how to apply the above function.但是，由于我正在考虑来自 dataframe 的列，因此我不知道如何应用上述 function。 Could you help me with it?你能帮我吗？

Answer 1

def preprocessing(line):
    line = line.lower()
    line = re.sub(r"[{}]".format(string.punctuation), " ", line)
    return line

tfidf_vectorizer = TfidfVectorizer(preprocessor=preprocessing)
tfidf = tfidf_vectorizer.fit_transform(df['Texts'])

kmeans = KMeans(n_clusters=2).fit(tfidf)

You just need to replace all_text with your df.您只需要将 all_text 替换为您的 df。 It'd be nice to build a pipeline first then apply vectorizer and Kmeans all at the same time.最好先构建一个管道，然后同时应用矢量化器和 Kmeans。

Also for getting more precise result more preprocessing of your text is never a bad idea.此外，为了获得更精确的结果，对文本进行更多预处理绝不是一个坏主意。 In addition, however I dont think lowering the text is a good idea since you naturally remove a good feature for style of writing(If we consider you want to find the author or assign author to a group) but for getting the sentiment of sentences yeah it's better to lower.另外，但是我不认为降低文本是一个好主意，因为你自然会删除一个很好的写作风格特征（如果我们认为你想找到作者或将作者分配给一个组）而是为了获得句子的情绪是的最好降低。

tf-idf 用于文本聚类分析

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-05-11 17:29:37

tf-idf 用于文本聚类分析

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-05-11 17:29:37

解决方案1
2 已采纳 2020-05-11 17:29:37