简体繁体 English

NLP for java，我应该使用哪个工具包？

[英]NLP for java, which toolkit should I use?

原文 2011-12-15 04:54:25 7 4 java/ text/ nlp/ text-mining

I'm working on a project that needs to count the occurrence of every word of a txt file. 我正在研究一个需要计算txt文件中每个单词出现次数的项目。 For example, I have a text file like this: 例如，我有一个这样的文本文件：

What Silver Lake Looks For in IPO Candidates 3 Companies Crushed by Earnings: Apple, Cirrus Logic, IBM IBM's Palmisano: How You Get To Be A 100-Year Old Company 银湖在首次公开募股候选人中寻找的3家公司被收入压垮：Apple，Cirrus Logic，IBM IBM的Palmisano：如何成为一家拥有100年历史的公司

If there are 3 sentences shown above in the file and I want to calculate every word's occurrence. 如果文件中有上面显示的3个句子，我想计算每个单词的出现次数。 Here, Companies and company should be considered as the same word "company"(lowercase), so the total occurrence for the word "company" is 2. 在这里，公司和公司应被视为同一个词“公司”（小写），因此“公司”一词的总出现次数为2。

Is there any NLP toolkit for java that can tell two words like "families" and "family" are actually from the same word "family"? 是否有任何用于java的NLP工具包可以告诉两个单词，如“family”和“family”实际上来自同一个单词“family”？

I'll count the occurrence of every word to further do the Naive Bayes training, so it's very important to get the accurate numbers of occurrences of each word. 我将计算每个单词的出现以进一步进行Naive Bayes训练，因此获得每个单词的准确出现次数非常重要。

4 个解决方案

Apache Lucene and OpenNLP provide good stemming algorithm implementations. Apache Lucene和OpenNLP提供了良好的词干算法实现。 You can review and use the best one that suites you. 您可以查看并使用最适合您的产品。 I've been using Lucene for my projects. 我一直在为我的项目使用Lucene。

您也可以查看LingPipe： http ：//alias-i.com/lingpipe/

You may also look at GATE : http://gate.ac.uk/ 你也可以看看GATE： http ： //gate.ac.uk/

If you want to use words to train a bag-of-word model, you can use TF-IDF value instead of the absolute count. 如果要使用单词训练词袋模型，可以使用TF-IDF值而不是绝对计数。

http://en.wikipedia.org/wiki/Tf%E2%80%93idf http://en.wikipedia.org/wiki/Tf%E2%80%93idf

What you are doing is called stemming (getting the root word). 你在做什么叫做词干（得到根词）。

As mentioned, Lingpipe, Gate and Lucene/Solr do stemming. 如上所述，Lingpipe，Gate和Lucene / Solr确实产生了干扰。 Another option is the stanford parser. 另一种选择是stanford解析器。 Or you could implement the Porter Stemming algo yourself. 或者您可以自己实施Porter Stemming算法。