简体   繁体   English

如何使用 Python 或 R 对大型文本语料库(例如职位列表)进行聚类?

[英]How to cluster a large text corpus (e.g. list of job titles) using Python or R?

I have a text corpus - the list of job tiles extracted from the web.我有一个文本语料库 - 从网络中提取的工作图块列表。 The list is pretty clean and stored as one column CSV file where titles are listed in rows.该列表非常干净,并存储为一列 CSV 文件,其中标题按行列出。

I have tried approaches using TF-IDF and Affinity Propagation, but this runs into memory issues.我尝试过使用 TF-IDF 和 Affinity Propagation 的方法,但这会遇到内存问题。 I tried to do this using word2vec and then applying a clustering algorithm, but it's not showing decent results.我尝试使用word2vec然后应用聚类算法来做到这一点,但它没有显示出不错的结果。 What could be the most effective way to cluster the dataset of around 75k job titles?对大约 75,000 个职位的数据集进行聚类的最有效方法是什么?

You can featurize the titles with word-level embeddings like gensim.models.word2vec and then use sklearn.cluster.DBSCAN .您可以使用特征化字级的嵌入,如标题gensim.models.word2vec然后用sklearn.cluster.DBSCAN It's hard to give any more concrete advice without seeing the dataset.如果没有看到数据集,很难给出更具体的建议。

One of the alternatives can be a topic modeling, eg Latent Dirichlet allocation (LDA) model.备选方案之一可以是主题建模,例如潜在狄利克雷分配(LDA)模型。

Minimal R example can look like:最小的R示例如下所示:

library(topicmodels)
library(tidytext)
library(data.table)
library(tm)

# Reading Craigslist job titles
jobs <- fread('https://raw.githubusercontent.com/h2oai/app-ask-craig/master/workflow/data/craigslistJobTitles.csv')
jobs[, doc_id := 1:.N]

# Building a text corpus
dtm <- DocumentTermMatrix(Corpus(DataframeSource(jobs[, .(doc_id, text = jobtitle)])),
                          control = list(removePunctuation = TRUE,
                                         removeNumbers = TRUE,
                                         stopwords = TRUE,
                                         stemming = TRUE,
                                         wordLengths = c(1, Inf)))

# Let's set number of topics to be equal to number of categories and fit LDA model
n_topics <- length(unique(jobs[, category]))
lda <- LDA(dtm, k = n_topics, method = 'Gibbs', control = list(seed = 1234, iter = 1e4))

# Kind of confusion matrix to inspect relevance
docs <- setDT(tidy(lda, matrix = 'gamma'))[, document := as.numeric(document)]
docs <- docs[, .(topic = paste0('topic_', .SD[gamma == max(gamma)]$topic)), by = .(doc_id = document)]
dcast(merge(jobs, docs)[, .N, by = .(category, topic)], category ~ topic, value.var = 'N')

Good news about Craigslist dataset is that it has labels (category) for each job title, so you can build kind of confusion matrix which looks like that:关于 Craigslist 数据集的好消息是它有每个职位的标签(类别),因此您可以构建类似这样的混淆矩阵:

          category topic_1 topic_2 topic_3 topic_4 topic_5 topic_6
1:      accounting     357     113    1091     194     248     241
2:  administrative     595     216    1550     260     372     526
3: customerservice    1142     458     331     329     320     567
4:       education     296     263     251     280    1638     578
5:    foodbeverage     325     369     287    1578     209     431
6:           labor     546    1098     276     324     332     853

Of course, LDA is unsupervised and estimated topics shouldn't match original categories, but we observe semantic intersections between eg labor category and topic_2 .当然,LDA 是无监督的,估计的主题不应该与原始类别匹配,但是我们观察到了例如labor类别和topic_2之间的语义交集。

First you would need to vectorize the text using tfidf or word2vec etc. Please see tfidf implementation below: I am skipping the preprocessing part as it would vary depending on the problem statement.首先,您需要使用 tfidf 或 word2vec 等对文本进行矢量化。请参阅下面的 tfidf 实现:我正在跳过预处理部分,因为它会因问题陈述而异。

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import DBSCAN

df = pd.read_csv('text.csv')
text = df.text.values
tfidf = TfidfVectorizer(stop_words='english')
vec_fit = tfidf.fit(text)
features = vec_fit.transform(text)
# now comes the clustering part, you can use KMeans, DBSCAN at your will
model = DBSCAN().fit(features)  # this might take ages as per size of the text and does not require to provide no. of clusters!!!
unseen_features = vec_fit.transform(unseen_text)
y_pred = model.predict(unseen_features)

There are evaluation techniques for clustering and available in sklean doc: https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation sklean 文档中有用于聚类的评估技术: https ://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python:如何删除文本语料库中的标点符号,但不删除特殊词(例如c ++,c#、. net等) - Python: How remove punctuation in text corpus, but not remove it in special words (e.g. c++, c#, .net, etc) 如何使用Python将电话号码列表(例如202.202.2020)格式化为(202)202-2020? - How to format a list of phone numbers e.g. 202.202.2020 into (202) 202-2020 using Python? Python:如何使用列表在列表之间进行复制,例如 list1[index1] = list2[index2] - Python: how to copy between lists using lists, e.g. list1[index1] = list2[index2] Matplotlib 在绘图时似乎不使用 rcparams,特别是对于文本(例如,默认情况下标题为粗体) - Matplotlib appears to not use rcparams when plotting, particularly for text (e.g. titles are bold when default is not) 使用NLTK查找事物列表(例如河流列表) - Find a list of things (e.g. a list of rivers) using NLTK 你如何获得 python 包的 pydoc 可访问路径列表,例如 numpy 或 tensorflow? - How do you get a list of pydoc accessible paths for a python package, e.g. numpy or tensorflow? 从python列表中删除项目,如何比较项目(例如numpy数组)? - Removal of an item from a python list, how are items compared (e.g. numpy arrays)? 如何使用计算方法(例如使用Python)找到以下所需概率? - How to find the following desired probability using computational method (e.g. using Python)? 如何从 for 循环 output 创建结构(例如列表)? - How to Create a structure (e.g. a list) from a for loop output? 如何强制 Python 给出解决方案而不是 &#39;Nan&#39;,例如 scipy.special 导入 kn 中的大量输入 - How to force Python give solutions instead of 'Nan', e.g. large input in scipy.special import kn
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM