简体繁体 English

使用MALLET的最佳主题建模工作流程是什么？

[英]What is the optimal topic-modelling workflow with MALLET?

原文 2016-05-23 08:52:33 8 1 python/ r/ text-mining/ lda/ mallet

Introduction 介绍

I'd like to know what other topic modellers consider to be an optimal topic-modelling workflow all the way from pre-processing to maintenance. 我想知道其他主题建模师认为是从预处理到维护的最佳主题建模工作流程。 While this question consists of a number of sub-questions (which I will specify below), I believe this thread would be useful for myself and others who are interested to learn about best practices of end-to-end process. 虽然这个问题包含了许多子问题（我将在下面指出），但我相信这个帖子对我自己和其他有兴趣了解端到端流程最佳实践的人有用。

Proposed Solution Specifications 提出的解决方案规范

I'd like the proposed solution to preferably rely on R for text processing (but Python is fine also) and topic-modelling itself to be done in MALLET (although if you believe other solutions work better, please let us know). 我希望建议的解决方案最好依赖于R进行文本处理（但Python也可以），主题建模本身也可以在MALLET中完成（尽管如果您认为其他解决方案效果更好，请告诉我们）。 I tend to use the topicmodels package in R , however I would like to switch to MALLET as it offers many benefits over topicmodels . 我倾向于使用R的topicmodels包，但是我想切换到MALLET因为它提供了超过topicmodels许多好处。 It can handle a lot of data, it does not rely on specific text pre-processing tools and it appears to be widely used for this purpose. 它可以处理大量数据，它不依赖于特定的文本预处理工具，它似乎被广泛用于此目的。 However some of the issues outline below are also relevant for topicmodels too. 但是，下面列出的一些问题也与topicmodels有关。 I'd like to know how others approach topic modelling and which of the below steps could be improved. 我想知道其他人如何处理主题建模以及可以改进以下哪些步骤。 Any useful piece of advice is welcome. 任何有用的建议都是受欢迎的。

Outline 大纲

Here is how it's going to work: I'm going to go through the workflow which in my opinion works reasonably well, and I'm going to outline problems at each step. 以下是它的工作方式：我将完成工作流程，在我看来工作得相当好，我将在每一步都概述问题。

Proposed Workflow 建议的工作流程

1. Clean text 1.清理文字

This involves removing punctuation marks, digits, stop words, stemming words and other text-processing tasks. 这涉及删除标点符号，数字，停用词，词干和其他文本处理任务。 Many of these can be done either as part of term-document matrix decomposition through functions such as for example TermDocumentMatrix from R 's package tm . 其中许多可以通过函数（例如来自R的包tm TermDocumentMatrix作为术语 - 文档矩阵分解的一部分来完成。

Problem: This however may need to be performed on the text strings directly, using functions such as gsub in order for MALLET to consume these strings. 问题：然而，可能需要使用诸如gsub函数直接对文本字符串执行此操作，以便MALLET使用这些字符串。 Performing in on the strings directly is not as efficient as it involves repetition (eg the same word would have to be stemmed several times) 直接执行字符串并不像重复那样有效（例如，同一个单词必须多次被删除）

2. Construct features 2.构建功能

In this step we construct a term-document matrix (TDM), followed by the filtering of terms based on frequency, and TF-IDF values. 在这一步中，我们构建一个术语 - 文档矩阵（TDM），然后根据频率和TF-IDF值过滤术语。 It is preferable to limit your bag of features to about 1000 or so. 最好将你的包的功能限制在1000左右。 Next go through the terms and identify what requires to be (1) dropped (some stop words will make it through), (2) renamed or (3) merged with existing entries. 接下来查看条款并确定需要（1）删除 （一些停用词将通过）， （2）重命名或（3）与现有条目合并。 While I'm familiar with the concept of stem-completion, I find that it rarely works well. 虽然我熟悉干完成的概念，但我发现它很少有效。

Problem: (1) Unfortunately MALLET does not work with TDM constructs and to make use of your TDM, you would need to find the difference between the original TDM -- with no features removed -- and the TDM that you are happy with. 问题： （1）不幸的是MALLET不能与TDM结构一起使用并且要使用你的TDM，你需要找到原始TDM（没有删除任何功能）和你满意的TDM之间的区别。 This difference would become stop words for MALLET. 这种差异将成为MALLET的停止词。 (2) On that note I'd also like to point out that feature selection does require a substantial amount of manual work and if anyone has ideas on how to minimise it, please share your thoughts. （2）关于这一点，我还想指出，功能选择确实需要大量的手工工作，如果有人对如何最小化它有任何想法，请分享您的想法。

Side note: If you decide to stick with R alone, then I can recommend the quanteda package which has a function dfm that accepts a thesaurus as one of the parameters. 旁注：如果您决定单独使用R ，那么我可以推荐具有函数dfm的quanteda包，该函数接受thesaurus作为参数之一。 This thesaurus allows to to capture patterns (usually regex) as opposed to words themselves, so for example you could have a pattern \\\\bsign\\\\w*.?ups? 这个词库允许捕获模式（通常是正则表达式）而不是单词本身，所以例如你可以有一个模式\\\\bsign\\\\w*.?ups? that would match sign-up , signed up and so on. 这会匹配sign-up ， signed up等等。

3. Find optimal parameters 3.找到最佳参数

This is a hard one. 这很难。 I tend to break data into test-train sets and run cross-validation fitting a model of k topics and testing the fit using held-out data. 我倾向于将数据分解为测试序列集并运行交叉验证，拟合k主题的模型并使用保持数据测试拟合。 Log likelihood is recorded and compared for different resolutions of topics. 记录对数可能性并针对不同主题分辨率进行比较。

Problem: Log likelihood does help to understand how good is the fit, but (1) it often tends to suggest that I need more topics than it is practically sensible and (2) given how long it generally takes to fit a model, it is virtually impossible to find or test a grid of optimal values such as iterations, alpha, burn-in and so on. 问题：对数似然确实有助于理解拟合的好坏，但（1）它往往表明我需要更多的主题而不是实际上合理的; （2）考虑到拟合模型通常需要多长时间，它是几乎不可能找到或测试最佳值的网格，例如迭代，alpha，老化等。

Side note: When selecting the optimal number of topics, I generally select a range of topics incrementing by 5 or so as incrementing a range by 1 generally takes too long to compute. 旁注：在选择最佳主题数时，我通常会选择一个主题范围递增5左右，因为将范围递增1通常需要很长时间来计算。

4. Maintenance 4.维护

It is easy to classify new data into a set existing topics. 可以轻松地将新数据分类为一组现有主题。 However if you are running it over time, you would naturally expect that some of your topics may cease to be relevant, while new topics may appear. 但是，如果您随着时间的推移运行它，您自然会期望某些主题可能不再相关，而新主题可能会出现。 Furthermore, it might be of interest to study the lifecycle of topics. 此外，研究主题的生命周期可能是有意义的。 This is difficult to account for as you are dealing with a problem that requires an unsupervised solution and yet for it to be tracked over time, you need to approach it in a supervised way. 这很难解释，因为您正在处理需要无监督解决方案的问题，而且随着时间的推移需要跟踪它，您需要以受监督的方式处理它。

Problem: To overcome the above issue, you would need to (1) fit new data into an old set of topics, (2) construct a new topic model based on new data (3) monitor log likelihood values over time and devise a threshold when to switch from old to new; 问题：要克服上述问题，您需要（1）将新数据拟合到一组旧主题中， （2）根据新数据构建新主题模型（3）监控日志可能性值随时间变化并设计阈值什么时候从旧换成新的; and (4) merge old and new solutions somehow so that the evolution of topics would be revealed to a lay observer. （4）以某种方式合并新旧解决方案，以便主题观察者能够发现主题的演变。

Recap of Problems 回顾问题

String cleaning for MALLET to consume the data is inefficient. MALLET消耗数据的字符串清理效率低下。
Feature selection requires manual work. 功能选择需要手动工作。
Optimal number of topics selection based on LL does not account for what is practically sensible 基于LL的最佳主题选择数量不能解释实际上合理的问题
Computational complexity does not give the opportunity to find an optimal grid of parameters (other than the number of topics) 计算复杂性没有机会找到最佳参数网格（主题数量除外）
Maintenance of topics over time poses challenging issues as you have to retain history but also reflect what is currently relevant. 随着时间的推移维护主题会带来挑战性问题，因为您必须保留历史记录，但也要反映当前相关的内容

If you've read that far, I'd like to thank you, this is a rather long post. 如果你已经读到那么远了，我想感谢你，这是一个相当长的帖子。 If you are interested in the suggest, feel free to either add more questions in the comments that you think are relevant or offer your thoughts on how to overcome some of these problems. 如果您对该建议感兴趣，请随意在您认为相关的评论中添加更多问题，或者就如何克服某些问题提出您的想法。

Cheers 干杯

1 个解决方案

Thank you for this thorough summary! 感谢您的全面总结！

As an alternative to topicmodels try the package mallet in R. It runs Mallet in a JVM directly from R and allows you to pull out results as R tables. 作为topicmodels的替代方案，尝试使用R中的包mallet 。它直接从R在JVM中运行Mallet，并允许您将结果作为R表提取。 I expect to release a new version soon, and compatibility with tm constructs is something others have requested. 我希望尽快发布新版本，并且与其他人要求的tm构造的兼容性。

To clarify, it's a good idea for documents to be at most around 1000 tokens long ( not vocabulary). 为了澄清，文档最多约为1000个令牌（不是词汇表）是一个好主意。 Any more and you start to lose useful information. 更多，你开始失去有用的信息。 The assumption of the model is that the position of a token within a given document doesn't tell you anything about that token's topic. 模型的假设是给定文档中令牌的位置不会告诉您有关该令牌主题的任何信息。 That's rarely true for longer documents, so it helps to break them up. 对于较长的文档而言，这种情况很少发生，因此有助于将其分解。

Another point I would add is that documents that are too short can also be a problem. 我要补充的另一点是，太短的文档也可能是个问题。 Tweets, for example, don't seem to provide enough contextual information about word co-occurrence, so the model often devolves into a one-topic-per-doc clustering algorithm. 例如，推文似乎没有提供关于单词共现的足够的上下文信息，因此该模型通常会转换为每个主题一个主题的聚类算法。 Combining multiple related short documents can make a big difference. 结合多个相关的短文档可以产生很大的不同。

Vocabulary curation is in practice the most challenging part of a topic modeling workflow. 词汇策划在实践中是主题建模工作流程中最具挑战性的部分。 Replacing selected multi-word terms with single tokens (for example by swapping spaces for underscores) before tokenizing is a very good idea. 在标记化之前用单个标记替换选定的多字词（例如通过交换下划线的空格）是一个非常好的主意。 Stemming is almost never useful, at least for English. 词干几乎从来没用过，至少对英语来说是这样。 Automated methods can help vocabulary curation, but this step has a profound impact on results (much more than the number of topics) and I am reluctant to encourage people to fully trust any system. 自动化方法可以帮助词汇策划，但这一步骤对结果产生了深远的影响（远远超过主题数量），我不愿意鼓励人们完全信任任何系统。

Parameters: I do not believe that there is a right number of topics. 参数：我不相信有正确数量的主题。 I recommend using a number of topics that provides the granularity that suits your application. 我建议使用一些主题，这些主题提供适合您应用程序的粒度。 Likelihood can often detect when you have too few topics, but after a threshold it doesn't provide much useful information. 可能性通常可以检测到您的主题太少，但在阈值之后它不会提供太多有用的信息。 Using hyperparameter optimization makes models much less sensitive to this setting as well, which might reduce the number of parameters that you need to search over. 使用超参数优化会使模型对此设置的敏感性降低，这可能会减少搜索所需的参数数量。

Topic drift: This is not a well understood problem. 主题漂移：这不是一个很好理解的问题。 More examples of real-world corpus change would be useful. 更多真实世界语料库变化的例子会很有用。 Looking for changes in vocabulary (eg proportion of out-of-vocabulary words) is a quick proxy for how well a model will fit. 寻找词汇量的变化（例如，词汇外单词的比例）是模型适合程度的快速代表。