[英]Python: How remove punctuation in text corpus, but not remove it in special words (e.g. c++, c#, .net, etc)
[英]How to cluster a large text corpus (e.g. list of job titles) using Python or R?
我有一个文本语料库 - 从网络中提取的工作图块列表。 该列表非常干净,并存储为一列 CSV 文件,其中标题按行列出。
我尝试过使用 TF-IDF 和 Affinity Propagation 的方法,但这会遇到内存问题。 我尝试使用word2vec
然后应用聚类算法来做到这一点,但它没有显示出不错的结果。 对大约 75,000 个职位的数据集进行聚类的最有效方法是什么?
您可以使用特征化字级的嵌入,如标题gensim.models.word2vec
然后用sklearn.cluster.DBSCAN
。 如果没有看到数据集,很难给出更具体的建议。
备选方案之一可以是主题建模,例如潜在狄利克雷分配(LDA)模型。
最小的R
示例如下所示:
library(topicmodels)
library(tidytext)
library(data.table)
library(tm)
# Reading Craigslist job titles
jobs <- fread('https://raw.githubusercontent.com/h2oai/app-ask-craig/master/workflow/data/craigslistJobTitles.csv')
jobs[, doc_id := 1:.N]
# Building a text corpus
dtm <- DocumentTermMatrix(Corpus(DataframeSource(jobs[, .(doc_id, text = jobtitle)])),
control = list(removePunctuation = TRUE,
removeNumbers = TRUE,
stopwords = TRUE,
stemming = TRUE,
wordLengths = c(1, Inf)))
# Let's set number of topics to be equal to number of categories and fit LDA model
n_topics <- length(unique(jobs[, category]))
lda <- LDA(dtm, k = n_topics, method = 'Gibbs', control = list(seed = 1234, iter = 1e4))
# Kind of confusion matrix to inspect relevance
docs <- setDT(tidy(lda, matrix = 'gamma'))[, document := as.numeric(document)]
docs <- docs[, .(topic = paste0('topic_', .SD[gamma == max(gamma)]$topic)), by = .(doc_id = document)]
dcast(merge(jobs, docs)[, .N, by = .(category, topic)], category ~ topic, value.var = 'N')
关于 Craigslist 数据集的好消息是它有每个职位的标签(类别),因此您可以构建类似这样的混淆矩阵:
category topic_1 topic_2 topic_3 topic_4 topic_5 topic_6
1: accounting 357 113 1091 194 248 241
2: administrative 595 216 1550 260 372 526
3: customerservice 1142 458 331 329 320 567
4: education 296 263 251 280 1638 578
5: foodbeverage 325 369 287 1578 209 431
6: labor 546 1098 276 324 332 853
当然,LDA 是无监督的,估计的主题不应该与原始类别匹配,但是我们观察到了例如labor
类别和topic_2
之间的语义交集。
首先,您需要使用 tfidf 或 word2vec 等对文本进行矢量化。请参阅下面的 tfidf 实现:我正在跳过预处理部分,因为它会因问题陈述而异。
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import DBSCAN
df = pd.read_csv('text.csv')
text = df.text.values
tfidf = TfidfVectorizer(stop_words='english')
vec_fit = tfidf.fit(text)
features = vec_fit.transform(text)
# now comes the clustering part, you can use KMeans, DBSCAN at your will
model = DBSCAN().fit(features) # this might take ages as per size of the text and does not require to provide no. of clusters!!!
unseen_features = vec_fit.transform(unseen_text)
y_pred = model.predict(unseen_features)
sklean 文档中有用于聚类的评估技术: https ://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.