[英]How to cluster different texts from different files?
I would like to cluster texts from different files to their topics.我想将来自不同文件的文本聚类到它们的主题。 I am using the 20 newsgroup dataset.
我正在使用 20 个新闻组数据集。 So there are different categories and I would like to cluster the texts to these categories with DBSCAN.
所以有不同的类别,我想用 DBSCAN 将文本聚类到这些类别。 My problem is how to do this?
我的问题是如何做到这一点?
At the moment I am saving each text of a file in a dict as a string.目前,我将文件中的每个文本作为字符串保存在 dict 中。 Then, I am removing several characters and words and extracting nouns from each dict entry.
然后,我将删除几个字符和单词并从每个 dict 条目中提取名词。 Then, I would like to apply Tf-idf on each dict entry which works but how can I pass this to DBSCAN to cluster this in categories?
然后,我想在每个有效的 dict 条目上应用 Tf-idf,但如何将其传递给 DBSCAN 以将其按类别进行聚类?
my text processing and data handling:我的文本处理和数据处理:
counter = 0
dic = {}
for i in range(len(categories)):
path = Path('dataset/20news/%s/' % categories[i])
print("Getting files from: %s" %path)
files = os.listdir(path)
for f in files:
with open(path/f, 'r',encoding = "latin1") as file:
data = file.read()
dic[counter] = data
counter += 1
if preprocess == True:
print("processing Data...")
content = preprocessText(data)
if get_nouns == True:
content = nounExtractor(content)
tfidf_vectorizer = TfidfVectorizer(stop_words=stop_words, max_features=max_features)
for i in range(len(content)):
content[i] = tfidf_vectorizer.fit_transform(content[i])
So I would like to pass each text to DBSCAN and I think it would be wrong to put all texts in one string because then there is no way to assign labels to it, am I right?所以我想将每个文本传递给 DBSCAN,我认为将所有文本放在一个字符串中是错误的,因为这样就无法为其分配标签,对吗?
I hope my explanation is not too confusing.我希望我的解释不会太混乱。
Best regards!此致!
EDIT:编辑:
for f in files:
with open(path/f, 'r',encoding = "latin1") as file:
data = file.read()
all_text.append(data)
tfidf_vectorizer = TfidfVectorizer(stop_words=stop_words, max_features=max_features)
tfidf_vectorizer.fit(all_text)
text_vectors = []
for text in all_text:
text_vectors.append(tfidf_vectorizer.transform(text))
You should fit the TFIDF vectorizer to the whole training text corpus, and then create a vector representation for each text/document on it's own by transforming it using the TFIDF, you should then apply clustering to those vector representation for the documents.您应该将 TFIDF 向量化器拟合到整个训练文本语料库中,然后通过使用 TFIDF 对其进行转换来为每个文本/文档自己创建一个向量表示,然后您应该将聚类应用于文档的这些向量表示。
EDIT编辑
A simply edit to your original code would be instead of the following loop对原始代码的简单编辑将代替以下循环
for i in range(len(content)):
content[i] = tfidf_vectorizer.fit_transform(content[i])
You could do this你可以这样做
transformed_contents = tfidf_vectorizer.fit_transform(content)
transformed_contents
will then contain the vectors that you should run your clustering algorithm against.然后,
transformed_contents
将包含您应该针对其运行聚类算法的向量。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.