I have a text corpus - the list of job tiles extracted from the web. The list is pretty clean and stored as one column CSV file where titles are listed in rows.
I have tried approaches using TF-IDF and Affinity Propagation, but this runs into memory issues. I tried to do this using word2vec
and then applying a clustering algorithm, but it's not showing decent results. What could be the most effective way to cluster the dataset of around 75k job titles?
You can featurize the titles with word-level embeddings like gensim.models.word2vec
and then use sklearn.cluster.DBSCAN
. It's hard to give any more concrete advice without seeing the dataset.
One of the alternatives can be a topic modeling, eg Latent Dirichlet allocation (LDA) model.
Minimal R
example can look like:
library(topicmodels)
library(tidytext)
library(data.table)
library(tm)
# Reading Craigslist job titles
jobs <- fread('https://raw.githubusercontent.com/h2oai/app-ask-craig/master/workflow/data/craigslistJobTitles.csv')
jobs[, doc_id := 1:.N]
# Building a text corpus
dtm <- DocumentTermMatrix(Corpus(DataframeSource(jobs[, .(doc_id, text = jobtitle)])),
control = list(removePunctuation = TRUE,
removeNumbers = TRUE,
stopwords = TRUE,
stemming = TRUE,
wordLengths = c(1, Inf)))
# Let's set number of topics to be equal to number of categories and fit LDA model
n_topics <- length(unique(jobs[, category]))
lda <- LDA(dtm, k = n_topics, method = 'Gibbs', control = list(seed = 1234, iter = 1e4))
# Kind of confusion matrix to inspect relevance
docs <- setDT(tidy(lda, matrix = 'gamma'))[, document := as.numeric(document)]
docs <- docs[, .(topic = paste0('topic_', .SD[gamma == max(gamma)]$topic)), by = .(doc_id = document)]
dcast(merge(jobs, docs)[, .N, by = .(category, topic)], category ~ topic, value.var = 'N')
Good news about Craigslist dataset is that it has labels (category) for each job title, so you can build kind of confusion matrix which looks like that:
category topic_1 topic_2 topic_3 topic_4 topic_5 topic_6
1: accounting 357 113 1091 194 248 241
2: administrative 595 216 1550 260 372 526
3: customerservice 1142 458 331 329 320 567
4: education 296 263 251 280 1638 578
5: foodbeverage 325 369 287 1578 209 431
6: labor 546 1098 276 324 332 853
Of course, LDA is unsupervised and estimated topics shouldn't match original categories, but we observe semantic intersections between eg labor
category and topic_2
.
First you would need to vectorize the text using tfidf or word2vec etc. Please see tfidf implementation below: I am skipping the preprocessing part as it would vary depending on the problem statement.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import DBSCAN
df = pd.read_csv('text.csv')
text = df.text.values
tfidf = TfidfVectorizer(stop_words='english')
vec_fit = tfidf.fit(text)
features = vec_fit.transform(text)
# now comes the clustering part, you can use KMeans, DBSCAN at your will
model = DBSCAN().fit(features) # this might take ages as per size of the text and does not require to provide no. of clusters!!!
unseen_features = vec_fit.transform(unseen_text)
y_pred = model.predict(unseen_features)
There are evaluation techniques for clustering and available in sklean doc: https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.