简体   繁体   English

将Python字符串传递给Mallet以进行主题建模

[英]Passing Python strings to Mallet for topic modelling

I'm building a corpus of texts harvested alongside some metadata from HTML with BeautifulSoup. 我正在使用BeautifulSoup构建一个与HTML中的元数据一起收集的文本语料库。 It would be really helpful if I could call Mallet from within Python, and have it model topics from Python strings, rather than from text files in a directory. 如果我可以从Python中调用Mallet并让它从Python字符串中模拟主题,而不是从目录中的文本文件中模拟主题,那将非常有用。 That way I could put the n keywords located by Mallet into each file. 这样我就可以将Mallet定位的n个关键字放入每个文件中。

I get a message saying that Mallet has been recognised when I run: 我收到一条消息,说我跑的时候已经认出了Mallet:

from nltk.classify import mallet
from subprocess import call
mallet.config_mallet("malletdir/mallet-2.0.7/bin")

But I haven't had any luck with the next steps, and am not even sure if Mallet accepts anything other than saved files. 但是我对接下来的步骤没有任何好运,甚至不确定Mallet是否接受除保存文件以外的任何内容。

I have not been able to turn up any documentation that I can really understand. 我无法找到任何我真正理解的文档。 Has anybody seen digestable documentation for this? 有人看过这个可摘文件吗? (The NLTK book doesn't get into Mallet). (NLTK书没有进入Mallet)。 I would also be happy to learn of any other means of topic modelling within Python that I could operationalise without a really deep knowledge of Python. 我也很乐意学习Python中任何其他主题建模方法,我可以在没有深入Python知识的情况下进行操作。

Sorry, this is my first rodeo. 对不起,这是我的第一个牛仔竞技表演。

In case you are still looking for a solution: Gensim (a Python topic modeling/machine learning packet) has a wrapper for Mallet which is easy to use and well documented. 如果您仍在寻找解决方案:Gensim(一个Python主题建模/机器学习包)有一个Mallet的包装器,它易于使用且记录良好。 Here are some Gensim tutorials and a specific tutorial for the Mallet wrapper. 以下是一些Gensim教程和Mallet包装器的特定教程 You may also want to read some installation instructions (mostly the part about setting Java memory) here and then you'd be ready to go. 您可能还需要阅读一些安装说明(主要是关于设置Java内存的部分) 在这里 ,然后你会准备好去。

I once tried implementing Mallet with an NLTK project and I too ran into dead end after dead end. 我曾尝试用NLTK项目实现Mallet ,并且在死胡同之后我也陷入了死胡同。 I think that main thing to keep in here is Mallet is Java based while NLTK is written in Python. 我认为要保留的主要内容是Mallet是基于Java的,而NLTK是用Python编写的。

You already knew that but my point is for me personally I struggled with mixing the technologies because I do not have a strong background with Java. 你已经知道了,但我个人认为我在努力混合技术,因为我没有扎实的Java背景。 I've received the same feedback from coworkers about Mallet with Python, "Be ready to spend a lot of time debugging." 我从同事那里收到了关于使用Python的Mallet的相同反馈,“准备花很多时间调试。”

Since then I've been using the sklearn library for Python. 从那时起,我一直在使用sklearn库进行Python。 It is aimed at machine learning more generally, not directly for NLP but can be used for it quite nicely. 它更普遍地针对机器学习,而不是直接针对NLP,但可以很好地用于它。 It comes with a very large selection of modelling tools and most of it seems to rely on NumPy so it should be pretty fast. 它配备了大量的建模工具,大部分都依赖于NumPy所以它应该非常快。 I've used it quite a bit and can say that it is very well written and documented. 我已经使用了很多,可以说它写得很好并且有文档记录。

I don't want to discourage you from using Mallet, especially just because I said so. 我不想阻止你使用Mallet,特别是因为我这么说。 But if you are open to alternatives, I think you will find that when building projects with NLTK it's far easier to using Python modules since it itself is written in Python. 但是如果您对替代方案持开放态度,我认为您会发现在使用NLTK构建项目时,使用Python模块要容易得多,因为它本身是用Python编写的。 I hope this helps! 我希望这有帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM