简体   繁体   中英

Speed up use of WordNet lemmatizer for Java

Another question is similar to this one, but it is in different programming language and it seems to address a related but not the same problem. Is it possible to speed up Wordnet Lemmatizer?

We are stemming tons of words in a text and the code is spending more than 90% with just stemming as can be seen in the picture.

分析过程

As we read through the code a little and profile the code, it seemed like the wordNet is actually reading from file when he stems which takes most of the code execution time! Is there a way to increase the performance by, say, using a database instead of file reading to support the data for the stemming process or to load everything necessary to memory and ignore the file? Or adding some caching to the stemming process?

Are there some tools that would be easy to plug in to replace the line reading?

See the line reading profiling here:

在此处输入图片说明

As you can see, the file reading in summary takes up to 62% of run time.

One can use MapBackedDictionary or a DatabaseBackedDictionary instead of a FileBackedDictionary.

I describe how I succeded in running with MapBackedDictionary.

It is required to use jwnl utilities. If you open WordNet project, you can use their class DictionaryToMap.java main method to convert your existing dicitonary folder to a map fodler.

After that you can create a map_properties.xml file similar to the file_properties.xml you used earlier for your FileBackedDictionary. This time tags will differ a bit. I am posting here my example xml, which was working out well for me.

<?xml version="1.0" encoding="UTF-8"?>
<jwnl_properties language="en">
<version publisher="Princeton" number="3.0" language="en"/>
<dictionary class="net.didion.jwnl.dictionary.MapBackedDictionary">
    <param name="morphological_processor" value="net.didion.jwnl.dictionary.morph.DefaultMorphologicalProcessor">
        <param name="operations">
            <param value="net.didion.jwnl.dictionary.morph.LookupExceptionsOperation"/>
            <param value="net.didion.jwnl.dictionary.morph.DetachSuffixesOperation">
                <param name="noun" value="|s=|ses=s|xes=x|zes=z|ches=ch|shes=sh|men=man|ies=y|"/>
                <param name="verb" value="|s=|ies=y|es=e|es=|ed=e|ed=|ing=e|ing=|"/>
                <param name="adjective" value="|er=|est=|er=e|est=e|"/>
                <param name="operations">
                    <param value="net.didion.jwnl.dictionary.morph.LookupIndexWordOperation"/>
                    <param value="net.didion.jwnl.dictionary.morph.LookupExceptionsOperation"/>
                </param>
            </param>
            <param value="net.didion.jwnl.dictionary.morph.TokenizerOperation">
                <param name="delimiters">
                    <param value=" "/>
                    <param value="-"/>
                </param>
                <param name="token_operations">
                    <param value="net.didion.jwnl.dictionary.morph.LookupIndexWordOperation"/>
                    <param value="net.didion.jwnl.dictionary.morph.LookupExceptionsOperation"/>
                    <param value="net.didion.jwnl.dictionary.morph.DetachSuffixesOperation">
                        <param name="noun" value="|s=|ses=s|xes=x|zes=z|ches=ch|shes=sh|men=man|ies=y|"/>
                        <param name="verb" value="|s=|ies=y|es=e|es=|ed=e|ed=|ing=e|ing=|"/>
                        <param name="adjective" value="|er=|est=|er=e|est=e|"/>
                        <param name="operations">
                            <param value="net.didion.jwnl.dictionary.morph.LookupIndexWordOperation"/>
                            <param value="net.didion.jwnl.dictionary.morph.LookupExceptionsOperation"/>
                        </param>
                    </param>
                </param>
            </param>
        </param>
    </param>
    <param name="dictionary_element_factory" value="net.didion.jwnl.data.MapBackedDictionaryElementFactory"/>
    <param name="file_type" value="net.didion.jwnl.princeton.file.PrincetonObjectDictionaryFile"/>
    <param name="dictionary_path" value="path\to\wordnetMap\"/>
</dictionary>
<resource class="PrincetonResource"/>
</jwnl_properties>

Pay attention to the path to wordnetMap - set it to where you output the conversion of dictionary with the method mentioned earlier.

Don't forget to initialize JWNL with the new properties file. The MapBackedDictionary will take longer to load initially, but the performance boost is extreme.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM