[英]How to cluster a large text corpus (e.g. list of job titles) using Python or R?
I have a text corpus - the list of job tiles extracted from the web.我有一个文本语料库 - 从网络中提取的工作图块列表。 The list is pretty clean and stored as one column CSV file where titles are listed in rows.
该列表非常干净,并存储为一列 CSV 文件,其中标题按行列出。
I have tried approaches using TF-IDF and Affinity Propagation, but this runs into memory issues.我尝试过使用 TF-IDF 和 Affinity Propagation 的方法,但这会遇到内存问题。 I tried to do this using
word2vec
and then applying a clustering algorithm, but it's not showing decent results.我尝试使用
word2vec
然后应用聚类算法来做到这一点,但它没有显示出不错的结果。 What could be the most effective way to cluster the dataset of around 75k job titles?对大约 75,000 个职位的数据集进行聚类的最有效方法是什么?
You can featurize the titles with word-level embeddings like gensim.models.word2vec
and then use sklearn.cluster.DBSCAN
.您可以使用特征化字级的嵌入,如标题
gensim.models.word2vec
然后用sklearn.cluster.DBSCAN
。 It's hard to give any more concrete advice without seeing the dataset.如果没有看到数据集,很难给出更具体的建议。
One of the alternatives can be a topic modeling, eg Latent Dirichlet allocation (LDA) model.备选方案之一可以是主题建模,例如潜在狄利克雷分配(LDA)模型。
Minimal R
example can look like:最小的
R
示例如下所示:
library(topicmodels)
library(tidytext)
library(data.table)
library(tm)
# Reading Craigslist job titles
jobs <- fread('https://raw.githubusercontent.com/h2oai/app-ask-craig/master/workflow/data/craigslistJobTitles.csv')
jobs[, doc_id := 1:.N]
# Building a text corpus
dtm <- DocumentTermMatrix(Corpus(DataframeSource(jobs[, .(doc_id, text = jobtitle)])),
control = list(removePunctuation = TRUE,
removeNumbers = TRUE,
stopwords = TRUE,
stemming = TRUE,
wordLengths = c(1, Inf)))
# Let's set number of topics to be equal to number of categories and fit LDA model
n_topics <- length(unique(jobs[, category]))
lda <- LDA(dtm, k = n_topics, method = 'Gibbs', control = list(seed = 1234, iter = 1e4))
# Kind of confusion matrix to inspect relevance
docs <- setDT(tidy(lda, matrix = 'gamma'))[, document := as.numeric(document)]
docs <- docs[, .(topic = paste0('topic_', .SD[gamma == max(gamma)]$topic)), by = .(doc_id = document)]
dcast(merge(jobs, docs)[, .N, by = .(category, topic)], category ~ topic, value.var = 'N')
Good news about Craigslist dataset is that it has labels (category) for each job title, so you can build kind of confusion matrix which looks like that:关于 Craigslist 数据集的好消息是它有每个职位的标签(类别),因此您可以构建类似这样的混淆矩阵:
category topic_1 topic_2 topic_3 topic_4 topic_5 topic_6
1: accounting 357 113 1091 194 248 241
2: administrative 595 216 1550 260 372 526
3: customerservice 1142 458 331 329 320 567
4: education 296 263 251 280 1638 578
5: foodbeverage 325 369 287 1578 209 431
6: labor 546 1098 276 324 332 853
Of course, LDA is unsupervised and estimated topics shouldn't match original categories, but we observe semantic intersections between eg labor
category and topic_2
.当然,LDA 是无监督的,估计的主题不应该与原始类别匹配,但是我们观察到了例如
labor
类别和topic_2
之间的语义交集。
First you would need to vectorize the text using tfidf or word2vec etc. Please see tfidf implementation below: I am skipping the preprocessing part as it would vary depending on the problem statement.首先,您需要使用 tfidf 或 word2vec 等对文本进行矢量化。请参阅下面的 tfidf 实现:我正在跳过预处理部分,因为它会因问题陈述而异。
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import DBSCAN
df = pd.read_csv('text.csv')
text = df.text.values
tfidf = TfidfVectorizer(stop_words='english')
vec_fit = tfidf.fit(text)
features = vec_fit.transform(text)
# now comes the clustering part, you can use KMeans, DBSCAN at your will
model = DBSCAN().fit(features) # this might take ages as per size of the text and does not require to provide no. of clusters!!!
unseen_features = vec_fit.transform(unseen_text)
y_pred = model.predict(unseen_features)
There are evaluation techniques for clustering and available in sklean doc: https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation sklean 文档中有用于聚类的评估技术: https ://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.