简体   繁体   English

使用 apache lucene 进行词形还原

[英]Lemmatization with apache lucene

I'm developing a text analysis project using apache lucene.我正在使用 apache lucene 开发一个文本分析项目。 I need to lemmatize some text (transform the words to their canonical forms).我需要对一些文本进行词形还原(将单词转换为其规范形式)。 I've already written the code that makes stemming.我已经编写了进行词干提取的代码。 Using it, I am able to convert the following sentence使用它,我可以转换以下句子

The stem is the part of the word that never changes even when morphologically inflected;词干是词的一部分,即使在形态变化时也不会改变; a lemma is the base form of the word.引理是单词的基本形式。 For example, from "produced", the lemma is "produce", but the stem is "produc-".例如,从“生产”,引理是“生产”,但词干是“生产-”。 This is because there are words such as production这是因为有生产等词

into进入

stem part word never chang even when morpholog inflect lemma base form word exampl from produc lemma produc stem produc becaus word product词干部分词永远不会改变,即使形态学从 produc lemma produc 词干 produc 因为词产品而变形引理基础形式词示例

However, I need to get the base forms of the words: example instead of exampl , produce instead of produc , and so on.不过,我需要的话基本形式:例如,代替为例产生的替代produc,等等。

I am using lucene because it has analyzers for many languages (I need at least English and Russian).我使用 lucene 是因为它有多种语言的分析器(我至少需要英语和俄语)。 I know about Stanford NLP library, but it has no Russian language support.我知道斯坦福 NLP库,但它没有俄语支持。

So is there any way to do lemmatization for several languages like I do stemming using lucene?那么有没有办法像我使用 lucene 做词干那样对几种语言进行词形还原?

The simplified version of my code responsible for stemming:我负责词干提取的代码的简化版本:

//Using apache tika to identify the language
LanguageIdentifier identifier = new LanguageIdentifier(text);
//getting analyzer according to the language (eg, EnglishAnalyzer for 'en')
Analyzer analyzer = getAnalyzer(identifier.getLanguage());
TokenStream stream = analyzer.tokenStream("field", text);
stream.reset();
while (stream.incrementToken()) {
    String stem = stream.getAttribute(CharTermAttribute.class).toString();
    // doing something with the stem
    System.out.print(stem+ " ");
}
stream.end();
stream.close();

UPDATE: I found the library that does almost what I need (for English and Russian languages) and uses apache lucene (although in its own way), it's definitely worth exploring.更新:我发现该几乎可以满足我的需求(针对英语和俄语)并使用 apache lucene(尽管以自己的方式),它绝对值得探索。

In case someone still needs it, I decided to return to this question and illustrate how to use the russianmorphology library I found earlier to do lemmatization for English and Russian languages.如果有人仍然需要它,我决定回到这个问题并说明如何使用我之前找到的russianmorphology库来对英语和俄语进行词形还原。

First of all, you will need these dependencies (besides the lucene-core ):首先,您将需要这些依赖项(除了lucene-core ):

<!-- if you need Russain -->
<dependency>
    <groupId>org.apache.lucene.morphology</groupId>
    <artifactId>russian</artifactId>
    <version>1.1</version>
</dependency>

<!-- if you need English-->
<dependency>
    <groupId>org.apache.lucene.morphology</groupId>
    <artifactId>english</artifactId>
    <version>1.1</version>
</dependency>

<dependency>
    <groupId>org.apache.lucene.morphology</groupId>
    <artifactId>morph</artifactId>
    <version>1.1</version>
</dependency>

Note that these artifacts are located at CUBA repository ( https://dl.bintray.com/cuba-platform/main/ ).请注意,这些工件位于 CUBA 存储库 ( https://dl.bintray.com/cuba-platform/main/ )。

Then, make sure you import the right analyzer:然后,确保导入正确的分析器:

import org.apache.lucene.morphology.english.EnglishAnalyzer;
import org.apache.lucene.morphology.russian.RussianAnalyzer;

These analyzers, unlike standard lucene analyzers, use MorphologyFilter which converts each word into a set of its normal forms.这些分析器与标准的 lucene 分析器不同,它们使用MorphologyFilter将每个单词转换为其一组正常形式。

So if you use the following code所以如果你使用下面的代码

String text = "The stem is the part of the word that never changes even when morphologically inflected; a lemma is the base form of the word. For example, from \"produced\", the lemma is \"produce\", but the stem is \"produc-\". This is because there are words such as production";
Analyzer analyzer = new EnglishAnalyzer();
TokenStream stream = analyzer.tokenStream("field", text);
stream.reset();
while (stream.incrementToken()) {
    String lemma = stream.getAttribute(CharTermAttribute.class).toString();
    System.out.print(lemma + " ");
}
stream.end();
stream.close();

it will print它会打印

the stem be the part of the word that never change even when morphologically inflected inflect a lemma be the base form of the word for example from produced produce the lemma be produce but the stem be produc this be because there are be word such as production词干是词的一部分,即使在形态上屈折时也不会改变 引理是词的基本形式,例如来自生产的生产 词干是生产但词干是生产 这是因为有生产这样的词

And for the Russian text对于俄文文本

String text = "Продолжаю цикл постов об астрологии и науке. Астрология не имеет научного обоснования, но является частью истории науки, частью культуры и общественного сознания. Поэтому астрологический взгляд на науку весьма интересен.";

the RussianAnalyzer will print the following: RussianAnalyzer将打印以下内容:

продолжать цикл пост об астрология и наука астрология не иметь научный обоснование но являться часть частью история наука часть частью культура и общественный сознание поэтому астрологический взгляд на наука весьма интересный продолжатьциклпостобастрологияинаукаастрологиянеиметьнаучныйобоснованиеноявлятьсячастьчастьюисториянаукачастьчастьюкультураиобщественныйсознаниепоэтомуастрологическийвзгляднанаукавесьмаинтересный

Yo may notice that some words have more that one base form, eg inflected is converted to [inflected, inflect] .你可能会注意到有些词有不止一种基本形式,例如inflected被转换为[inflected, inflect] If you don't like this behaviour, you would have to change the implementation of the org.apache.lucene.morphology.analyzer.MorhpologyFilter (if you are interested in how exactly to do it, let me know and I'll elaborate on this).如果您不喜欢这种行为,则必须更改org.apache.lucene.morphology.analyzer.MorhpologyFilter的实现(如果您对具体操作方法感兴趣,请告诉我,我将详细说明)这个)。

Hope it helps, good luck!希望能帮到你,祝你好运!

Yep, StanfordNLP is good for english.是的,StanfordNLP 非常适合英语。 But if you need support several language I can recommend you Freeling , check its Freeling_online_demo , please select language and output ( morphological analysis for lemmatization ).但是如果你需要支持多种语言我可以推荐你Freeling ,检查它的Freeling_online_demo ,请选择语言和输出(词形分析的词形分析)。 I dont speak russian but I think it works for this text:我不会说俄语,但我认为它适用于本文:

Продолжаю цикл постов об астрологии и науке. Продолжаю цикл постов об астрологии и науке。 Астрология не имеет научного обоснования, но является частью истории науки, частью культуры и общественного сознания. Астрология не имеет научного обоснования, но является частью истории науки, частью культуры новасть Поэтому астрологический взгляд на науку весьма интересен. Поэтому астрологический взгляд на науку весьма интересен。

For machine readability you can use the xml output (below your results) and for automatization you can integrate Freeling with python/java but usually I prefer just call it via command line.为了机器可读性,您可以使用 xml 输出(在您的结果下方),为了自动化,您可以将 Freeling 与 python/java 集成,但通常我更喜欢通过命令行调用它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM