简体   繁体   English

如何聚类来自不同文件的不同文本?

[英]How to cluster different texts from different files?

I would like to cluster texts from different files to their topics.我想将来自不同文件的文本聚类到它们的主题。 I am using the 20 newsgroup dataset.我正在使用 20 个新闻组数据集。 So there are different categories and I would like to cluster the texts to these categories with DBSCAN.所以有不同的类别,我想用 DBSCAN 将文本聚类到这些类别。 My problem is how to do this?我的问题是如何做到这一点?

At the moment I am saving each text of a file in a dict as a string.目前,我将文件中的每个文本作为字符串保存在 dict 中。 Then, I am removing several characters and words and extracting nouns from each dict entry.然后,我将删除几个字符和单词并从每个 dict 条目中提取名词。 Then, I would like to apply Tf-idf on each dict entry which works but how can I pass this to DBSCAN to cluster this in categories?然后,我想在每个有效的 dict 条目上应用 Tf-idf,但如何将其传递给 DBSCAN 以将其按类别进行聚类?

my text processing and data handling:我的文本处理和数据处理:

counter = 0
dic = {}
for i in range(len(categories)):
            path = Path('dataset/20news/%s/' % categories[i])
            print("Getting files from: %s" %path)
            files = os.listdir(path)
            for f in files:
                with open(path/f, 'r',encoding = "latin1") as file:
                    data = file.read()
                    dic[counter] = data
                    counter += 1
if preprocess == True:
        print("processing Data...")
        content = preprocessText(data)
if get_nouns == True:
        content = nounExtractor(content)
tfidf_vectorizer = TfidfVectorizer(stop_words=stop_words, max_features=max_features)
for i in range(len(content)):
        content[i] = tfidf_vectorizer.fit_transform(content[i])

So I would like to pass each text to DBSCAN and I think it would be wrong to put all texts in one string because then there is no way to assign labels to it, am I right?所以我想将每个文本传递给 DBSCAN,我认为将所有文本放在一个字符串中是错误的,因为这样就无法为其分配标签,对吗?

I hope my explanation is not too confusing.我希望我的解释不会太混乱。

Best regards!此致!

EDIT:编辑:

 for f in files:
                with open(path/f, 'r',encoding = "latin1") as file:
                    data = file.read()
                    all_text.append(data)
tfidf_vectorizer = TfidfVectorizer(stop_words=stop_words, max_features=max_features)
    tfidf_vectorizer.fit(all_text)
    text_vectors = [] 
    for text in all_text: 
        text_vectors.append(tfidf_vectorizer.transform(text))

You should fit the TFIDF vectorizer to the whole training text corpus, and then create a vector representation for each text/document on it's own by transforming it using the TFIDF, you should then apply clustering to those vector representation for the documents.您应该将 TFIDF 向量化器拟合到整个训练文本语料库中,然后通过使用 TFIDF 对其进行转换来为每个文本/文档自己创建一个向量表示,然后您应该将聚类应用于文档的这些向量表示。

EDIT编辑

A simply edit to your original code would be instead of the following loop对原始代码的简单编辑将代替以下循环

for i in range(len(content)):
        content[i] = tfidf_vectorizer.fit_transform(content[i])

You could do this你可以这样做

transformed_contents = tfidf_vectorizer.fit_transform(content)

transformed_contents will then contain the vectors that you should run your clustering algorithm against.然后, transformed_contents将包含您应该针对其运行聚类算法的向量。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何通过Selenium和Python从HTML DOM获得不同的文本 - How to get different texts from the HTML DOM through Selenium and Python 如何从 BeautifulSoup 中的一个标签分别获取不同的文本? - How to get different texts separately from one tag in BeautifulSoup? 如何在python中将文本的不同部分从一个文件传输到另一个文件 - How to transfer different portion of texts from one file to another in python 使用 Tesseract 来自几乎相同图像的不同文本 - Different texts from almost identical images with Tesseract 如何根据其内容将一组图像文件群集到不同的文件夹 - How to cluster a set of image files to different folders based on their content 为Python分类不同的文本 - categorizing different texts for Python 如何将来自不同线程的消息记录到不同的文件? - How to log messages from different threads to different files? 如何向不同的传入SMS文本发送相同的自定义响应 - How to send a the same custom response to different incoming SMS texts 如何从不同大小的不同文件中进行匹配 - How to do matching from different files with different sizes 如果在同一标签上包含 2 个不同的文本,如何删除一列? - How to delete a column if contains 2 different texts at the same label?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM