如何聚类来自不同文件的不同文本？

Question

I would like to cluster texts from different files to their topics.我想将来自不同文件的文本聚类到它们的主题。 I am using the 20 newsgroup dataset.我正在使用 20 个新闻组数据集。 So there are different categories and I would like to cluster the texts to these categories with DBSCAN.所以有不同的类别，我想用 DBSCAN 将文本聚类到这些类别。 My problem is how to do this?我的问题是如何做到这一点？

At the moment I am saving each text of a file in a dict as a string.目前，我将文件中的每个文本作为字符串保存在 dict 中。 Then, I am removing several characters and words and extracting nouns from each dict entry.然后，我将删除几个字符和单词并从每个 dict 条目中提取名词。 Then, I would like to apply Tf-idf on each dict entry which works but how can I pass this to DBSCAN to cluster this in categories?然后，我想在每个有效的 dict 条目上应用 Tf-idf，但如何将其传递给 DBSCAN 以将其按类别进行聚类？

my text processing and data handling:我的文本处理和数据处理：

counter = 0
dic = {}
for i in range(len(categories)):
            path = Path('dataset/20news/%s/' % categories[i])
            print("Getting files from: %s" %path)
            files = os.listdir(path)
            for f in files:
                with open(path/f, 'r',encoding = "latin1") as file:
                    data = file.read()
                    dic[counter] = data
                    counter += 1

if preprocess == True:
        print("processing Data...")
        content = preprocessText(data)
if get_nouns == True:
        content = nounExtractor(content)
tfidf_vectorizer = TfidfVectorizer(stop_words=stop_words, max_features=max_features)
for i in range(len(content)):
        content[i] = tfidf_vectorizer.fit_transform(content[i])

So I would like to pass each text to DBSCAN and I think it would be wrong to put all texts in one string because then there is no way to assign labels to it, am I right?所以我想将每个文本传递给 DBSCAN，我认为将所有文本放在一个字符串中是错误的，因为这样就无法为其分配标签，对吗？

I hope my explanation is not too confusing.我希望我的解释不会太混乱。

Best regards!此致！

EDIT:编辑：

 for f in files:
                with open(path/f, 'r',encoding = "latin1") as file:
                    data = file.read()
                    all_text.append(data)

tfidf_vectorizer = TfidfVectorizer(stop_words=stop_words, max_features=max_features)
    tfidf_vectorizer.fit(all_text)
    text_vectors = [] 
    for text in all_text: 
        text_vectors.append(tfidf_vectorizer.transform(text))

Answer 1

You should fit the TFIDF vectorizer to the whole training text corpus, and then create a vector representation for each text/document on it's own by transforming it using the TFIDF, you should then apply clustering to those vector representation for the documents.您应该将 TFIDF 向量化器拟合到整个训练文本语料库中，然后通过使用 TFIDF 对其进行转换来为每个文本/文档自己创建一个向量表示，然后您应该将聚类应用于文档的这些向量表示。

EDIT编辑

A simply edit to your original code would be instead of the following loop对原始代码的简单编辑将代替以下循环

for i in range(len(content)):
        content[i] = tfidf_vectorizer.fit_transform(content[i])

You could do this你可以这样做

transformed_contents = tfidf_vectorizer.fit_transform(content)

transformed_contents will then contain the vectors that you should run your clustering algorithm against.然后， transformed_contents将包含您应该针对其运行聚类算法的向量。

如何聚类来自不同文件的不同文本？

问题描述

1 个解决方案

解决方案1
0 2019-04-15 10:36:08

如何聚类来自不同文件的不同文本？

问题描述

1 个解决方案

解决方案1 0 2019-04-15 10:36:08

解决方案1
0 2019-04-15 10:36:08