简体   繁体   中英

How to cluster a large text corpus (e.g. list of job titles) using Python or R?

I have a text corpus - the list of job tiles extracted from the web. The list is pretty clean and stored as one column CSV file where titles are listed in rows.

I have tried approaches using TF-IDF and Affinity Propagation, but this runs into memory issues. I tried to do this using word2vec and then applying a clustering algorithm, but it's not showing decent results. What could be the most effective way to cluster the dataset of around 75k job titles?

You can featurize the titles with word-level embeddings like gensim.models.word2vec and then use sklearn.cluster.DBSCAN . It's hard to give any more concrete advice without seeing the dataset.

One of the alternatives can be a topic modeling, eg Latent Dirichlet allocation (LDA) model.

Minimal R example can look like:

library(topicmodels)
library(tidytext)
library(data.table)
library(tm)

# Reading Craigslist job titles
jobs <- fread('https://raw.githubusercontent.com/h2oai/app-ask-craig/master/workflow/data/craigslistJobTitles.csv')
jobs[, doc_id := 1:.N]

# Building a text corpus
dtm <- DocumentTermMatrix(Corpus(DataframeSource(jobs[, .(doc_id, text = jobtitle)])),
                          control = list(removePunctuation = TRUE,
                                         removeNumbers = TRUE,
                                         stopwords = TRUE,
                                         stemming = TRUE,
                                         wordLengths = c(1, Inf)))

# Let's set number of topics to be equal to number of categories and fit LDA model
n_topics <- length(unique(jobs[, category]))
lda <- LDA(dtm, k = n_topics, method = 'Gibbs', control = list(seed = 1234, iter = 1e4))

# Kind of confusion matrix to inspect relevance
docs <- setDT(tidy(lda, matrix = 'gamma'))[, document := as.numeric(document)]
docs <- docs[, .(topic = paste0('topic_', .SD[gamma == max(gamma)]$topic)), by = .(doc_id = document)]
dcast(merge(jobs, docs)[, .N, by = .(category, topic)], category ~ topic, value.var = 'N')

Good news about Craigslist dataset is that it has labels (category) for each job title, so you can build kind of confusion matrix which looks like that:

          category topic_1 topic_2 topic_3 topic_4 topic_5 topic_6
1:      accounting     357     113    1091     194     248     241
2:  administrative     595     216    1550     260     372     526
3: customerservice    1142     458     331     329     320     567
4:       education     296     263     251     280    1638     578
5:    foodbeverage     325     369     287    1578     209     431
6:           labor     546    1098     276     324     332     853

Of course, LDA is unsupervised and estimated topics shouldn't match original categories, but we observe semantic intersections between eg labor category and topic_2 .

First you would need to vectorize the text using tfidf or word2vec etc. Please see tfidf implementation below: I am skipping the preprocessing part as it would vary depending on the problem statement.

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import DBSCAN

df = pd.read_csv('text.csv')
text = df.text.values
tfidf = TfidfVectorizer(stop_words='english')
vec_fit = tfidf.fit(text)
features = vec_fit.transform(text)
# now comes the clustering part, you can use KMeans, DBSCAN at your will
model = DBSCAN().fit(features)  # this might take ages as per size of the text and does not require to provide no. of clusters!!!
unseen_features = vec_fit.transform(unseen_text)
y_pred = model.predict(unseen_features)

There are evaluation techniques for clustering and available in sklean doc: https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM