简体   繁体   English

加快Java的WordNet lemmatizer的使用

[英]Speed up use of WordNet lemmatizer for Java

Another question is similar to this one, but it is in different programming language and it seems to address a related but not the same problem. 另一个问题与此类似,但是它使用不同的编程语言,似乎解决了一个相关但不相同的问题。 Is it possible to speed up Wordnet Lemmatizer? 是否可以加速Wordnet Lemmatizer?

We are stemming tons of words in a text and the code is spending more than 90% with just stemming as can be seen in the picture. 正如我们在图片中看到的那样,我们在文本中添加了大量单词,而代码仅花费了90%以上的代码。

分析过程

As we read through the code a little and profile the code, it seemed like the wordNet is actually reading from file when he stems which takes most of the code execution time! 当我们稍微阅读一下代码并分析代码时,似乎wordNet阻止时实际上是从文件读取,这占用了大部分代码执行时间! Is there a way to increase the performance by, say, using a database instead of file reading to support the data for the stemming process or to load everything necessary to memory and ignore the file? 是否有办法通过使用数据库而不是文件读取来支持词干处理的数据或加载内存所需的所有内容并忽略文件来提高性能? Or adding some caching to the stemming process? 还是在茎处理过程中添加一些缓存?

Are there some tools that would be easy to plug in to replace the line reading? 是否有一些易于插入的工具可以代替线读数?

See the line reading profiling here: 请在此处查看行读取配置文件:

在此处输入图片说明

As you can see, the file reading in summary takes up to 62% of run time. 如您所见,摘要中读取文件最多需要运行时间的62%。

One can use MapBackedDictionary or a DatabaseBackedDictionary instead of a FileBackedDictionary. 可以使用MapBackedDictionary或DatabaseBackedDictionary代替FileBackedDictionary。

I describe how I succeded in running with MapBackedDictionary. 我描述了如何使用MapBackedDictionary成功运行。

It is required to use jwnl utilities. 需要使用jwnl实用程序。 If you open WordNet project, you can use their class DictionaryToMap.java main method to convert your existing dicitonary folder to a map fodler. 如果打开WordNet项目,则可以使用其类DictionaryToMap.java main方法将现有的数字文件夹转换为地图字体。

After that you can create a map_properties.xml file similar to the file_properties.xml you used earlier for your FileBackedDictionary. 之后,您可以创建一个类似于先前用于FileBackedDictionary的file_properties.xml的map_properties.xml文件。 This time tags will differ a bit. 这个时间标签会有所不同。 I am posting here my example xml, which was working out well for me. 我在这里发布我的示例xml,对我来说效果很好。

<?xml version="1.0" encoding="UTF-8"?>
<jwnl_properties language="en">
<version publisher="Princeton" number="3.0" language="en"/>
<dictionary class="net.didion.jwnl.dictionary.MapBackedDictionary">
    <param name="morphological_processor" value="net.didion.jwnl.dictionary.morph.DefaultMorphologicalProcessor">
        <param name="operations">
            <param value="net.didion.jwnl.dictionary.morph.LookupExceptionsOperation"/>
            <param value="net.didion.jwnl.dictionary.morph.DetachSuffixesOperation">
                <param name="noun" value="|s=|ses=s|xes=x|zes=z|ches=ch|shes=sh|men=man|ies=y|"/>
                <param name="verb" value="|s=|ies=y|es=e|es=|ed=e|ed=|ing=e|ing=|"/>
                <param name="adjective" value="|er=|est=|er=e|est=e|"/>
                <param name="operations">
                    <param value="net.didion.jwnl.dictionary.morph.LookupIndexWordOperation"/>
                    <param value="net.didion.jwnl.dictionary.morph.LookupExceptionsOperation"/>
                </param>
            </param>
            <param value="net.didion.jwnl.dictionary.morph.TokenizerOperation">
                <param name="delimiters">
                    <param value=" "/>
                    <param value="-"/>
                </param>
                <param name="token_operations">
                    <param value="net.didion.jwnl.dictionary.morph.LookupIndexWordOperation"/>
                    <param value="net.didion.jwnl.dictionary.morph.LookupExceptionsOperation"/>
                    <param value="net.didion.jwnl.dictionary.morph.DetachSuffixesOperation">
                        <param name="noun" value="|s=|ses=s|xes=x|zes=z|ches=ch|shes=sh|men=man|ies=y|"/>
                        <param name="verb" value="|s=|ies=y|es=e|es=|ed=e|ed=|ing=e|ing=|"/>
                        <param name="adjective" value="|er=|est=|er=e|est=e|"/>
                        <param name="operations">
                            <param value="net.didion.jwnl.dictionary.morph.LookupIndexWordOperation"/>
                            <param value="net.didion.jwnl.dictionary.morph.LookupExceptionsOperation"/>
                        </param>
                    </param>
                </param>
            </param>
        </param>
    </param>
    <param name="dictionary_element_factory" value="net.didion.jwnl.data.MapBackedDictionaryElementFactory"/>
    <param name="file_type" value="net.didion.jwnl.princeton.file.PrincetonObjectDictionaryFile"/>
    <param name="dictionary_path" value="path\to\wordnetMap\"/>
</dictionary>
<resource class="PrincetonResource"/>
</jwnl_properties>

Pay attention to the path to wordnetMap - set it to where you output the conversion of dictionary with the method mentioned earlier. 注意wordnetMap的路径-使用前面提到的方法将其设置为输出字典转换的位置。

Don't forget to initialize JWNL with the new properties file. 不要忘记使用新的属性文件初始化JWNL。 The MapBackedDictionary will take longer to load initially, but the performance boost is extreme. MapBackedDictionary最初需要花费更长的时间来加载,但是性能提升却是极端的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM