如何使用 Python 或 R 对大型文本语料库（例如职位列表）进行聚类？

Question

I have a text corpus - the list of job tiles extracted from the web.我有一个文本语料库 - 从网络中提取的工作图块列表。 The list is pretty clean and stored as one column CSV file where titles are listed in rows.该列表非常干净，并存储为一列 CSV 文件，其中标题按行列出。

I have tried approaches using TF-IDF and Affinity Propagation, but this runs into memory issues.我尝试过使用 TF-IDF 和 Affinity Propagation 的方法，但这会遇到内存问题。 I tried to do this using word2vec and then applying a clustering algorithm, but it's not showing decent results.我尝试使用word2vec然后应用聚类算法来做到这一点，但它没有显示出不错的结果。 What could be the most effective way to cluster the dataset of around 75k job titles?对大约 75,000 个职位的数据集进行聚类的最有效方法是什么？

Answer 1

You can featurize the titles with word-level embeddings like gensim.models.word2vec and then use sklearn.cluster.DBSCAN .您可以使用特征化字级的嵌入，如标题gensim.models.word2vec然后用sklearn.cluster.DBSCAN 。 It's hard to give any more concrete advice without seeing the dataset.如果没有看到数据集，很难给出更具体的建议。

Answer 2

One of the alternatives can be a topic modeling, eg Latent Dirichlet allocation (LDA) model.备选方案之一可以是主题建模，例如潜在狄利克雷分配（LDA）模型。

Minimal R example can look like:最小的R示例如下所示：

library(topicmodels)
library(tidytext)
library(data.table)
library(tm)

# Reading Craigslist job titles
jobs <- fread('https://raw.githubusercontent.com/h2oai/app-ask-craig/master/workflow/data/craigslistJobTitles.csv')
jobs[, doc_id := 1:.N]

# Building a text corpus
dtm <- DocumentTermMatrix(Corpus(DataframeSource(jobs[, .(doc_id, text = jobtitle)])),
                          control = list(removePunctuation = TRUE,
                                         removeNumbers = TRUE,
                                         stopwords = TRUE,
                                         stemming = TRUE,
                                         wordLengths = c(1, Inf)))

# Let's set number of topics to be equal to number of categories and fit LDA model
n_topics <- length(unique(jobs[, category]))
lda <- LDA(dtm, k = n_topics, method = 'Gibbs', control = list(seed = 1234, iter = 1e4))

# Kind of confusion matrix to inspect relevance
docs <- setDT(tidy(lda, matrix = 'gamma'))[, document := as.numeric(document)]
docs <- docs[, .(topic = paste0('topic_', .SD[gamma == max(gamma)]$topic)), by = .(doc_id = document)]
dcast(merge(jobs, docs)[, .N, by = .(category, topic)], category ~ topic, value.var = 'N')

Good news about Craigslist dataset is that it has labels (category) for each job title, so you can build kind of confusion matrix which looks like that:关于 Craigslist 数据集的好消息是它有每个职位的标签（类别），因此您可以构建类似这样的混淆矩阵：

          category topic_1 topic_2 topic_3 topic_4 topic_5 topic_6
1:      accounting     357     113    1091     194     248     241
2:  administrative     595     216    1550     260     372     526
3: customerservice    1142     458     331     329     320     567
4:       education     296     263     251     280    1638     578
5:    foodbeverage     325     369     287    1578     209     431
6:           labor     546    1098     276     324     332     853

Of course, LDA is unsupervised and estimated topics shouldn't match original categories, but we observe semantic intersections between eg labor category and topic_2 .当然，LDA 是无监督的，估计的主题不应该与原始类别匹配，但是我们观察到了例如labor类别和topic_2之间的语义交集。

Answer 3

First you would need to vectorize the text using tfidf or word2vec etc. Please see tfidf implementation below: I am skipping the preprocessing part as it would vary depending on the problem statement.首先，您需要使用 tfidf 或 word2vec 等对文本进行矢量化。请参阅下面的 tfidf 实现：我正在跳过预处理部分，因为它会因问题陈述而异。

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import DBSCAN

df = pd.read_csv('text.csv')
text = df.text.values
tfidf = TfidfVectorizer(stop_words='english')
vec_fit = tfidf.fit(text)
features = vec_fit.transform(text)
# now comes the clustering part, you can use KMeans, DBSCAN at your will
model = DBSCAN().fit(features)  # this might take ages as per size of the text and does not require to provide no. of clusters!!!
unseen_features = vec_fit.transform(unseen_text)
y_pred = model.predict(unseen_features)

There are evaluation techniques for clustering and available in sklean doc: https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation sklean 文档中有用于聚类的评估技术： https ://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation

如何使用 Python 或 R 对大型文本语料库（例如职位列表）进行聚类？

问题描述

3 个解决方案

解决方案1
1 2020-10-21 15:33:43

解决方案2
0 2020-10-21 21:30:16

解决方案3
0 2021-01-09 07:02:15

如何使用 Python 或 R 对大型文本语料库（例如职位列表）进行聚类？

问题描述

3 个解决方案

解决方案1 1 2020-10-21 15:33:43

解决方案2 0 2020-10-21 21:30:16

解决方案3 0 2021-01-09 07:02:15

解决方案1
1 2020-10-21 15:33:43

解决方案2
0 2020-10-21 21:30:16

解决方案3
0 2021-01-09 07:02:15