[英]tf-idf for text cluster-analysis
我想將 dataframe 中包含在df['Texts']
列中的小文本分組。 要分析的句子示例如下:
Texts
1 Donald Trump, Donald Trump news, Trump bleach, Trump injected bleach, bleach coronavirus.
2 Thank you Janey.......laughing so much at this........you have saved my sanity in these mad times. Only bleach Trump is using is on his heed 🤣
3 His more uncharitable critics said Trump had suggested that Americans drink bleach. Trump responded that he was being sarcastic.
4 Outcry after Trump suggests injecting disinfectant as treatment.
5 Trump Suggested 'Injecting' Disinfectant to Cure Coronavirus?
6 The study also showed that bleach and isopropyl alcohol killed the virus in saliva or respiratory fluids in a matter of minutes.
因為我知道 TF-IDF 對集群很有用,所以我一直在使用以下代碼行(通過關注社區中的一些先前問題):
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import re
import string
def preprocessing(line):
line = line.lower()
line = re.sub(r"[{}]".format(string.punctuation), " ", line)
return line
tfidf_vectorizer = TfidfVectorizer(preprocessor=preprocessing)
tfidf = tfidf_vectorizer.fit_transform(all_text)
kmeans = KMeans(n_clusters=2).fit(tfidf) # the number of clusters could be manually changed
但是,由於我正在考慮來自 dataframe 的列,因此我不知道如何應用上述 function。 你能幫我嗎?
def preprocessing(line):
line = line.lower()
line = re.sub(r"[{}]".format(string.punctuation), " ", line)
return line
tfidf_vectorizer = TfidfVectorizer(preprocessor=preprocessing)
tfidf = tfidf_vectorizer.fit_transform(df['Texts'])
kmeans = KMeans(n_clusters=2).fit(tfidf)
您只需要將 all_text 替換為您的 df。 最好先構建一個管道,然后同時應用矢量化器和 Kmeans。
此外,為了獲得更精確的結果,對文本進行更多預處理絕不是一個壞主意。 另外,但是我不認為降低文本是一個好主意,因為你自然會刪除一個很好的寫作風格特征(如果我們認為你想找到作者或將作者分配給一個組)而是為了獲得句子的情緒是的最好降低。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.