简体   繁体   中英

Spell checker using lucene

I am trying to write a spell corrector using the lucene spellchecker. I would want to give it a single text file with blog text content. The problem is that it works only when I give it one sentence/word per line in my dictionary file. Also the suggest API returns results without giving any weightage to number of occurences. Following is the source code

   public class SpellCorrector {

        SpellChecker spellChecker = null;

        public SpellCorrector() {
                try {
                        File file = new File("/home/ubuntu/spellCheckIndex");
                        Directory directory = FSDirectory.open(file);

                        spellChecker = new SpellChecker(directory);

                        StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
                        IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, analyzer);
                        spellChecker.indexDictionary(
                                        new PlainTextDictionary(new File("/home/ubuntu/main.dictionary")), config, true);
                                                                        //Should I format this file with one sentence/word per line?

                } catch (IOException e) {

                }

        }

        public String correct(String query) {
                if (spellChecker != null) {
                        try {
                                String[] suggestions = spellChecker.suggestSimilar(query, 5);  
                                 // This returns the suggestion not based on occurence but based on when it occured

                                if (suggestions != null) {
                                        if (suggestions.length != 0) {
                                                return suggestions[0];
                                        }
                                }
                        } catch (IOException e) {
                                return null;
                        }
                }
                return null;
        }
}

Do I need to make some changes?

Regarding your first issue, sounds like the expected, documented dictionary format, here in the PlainTextDictionary API . If you want to pass arbitrary text in, you might want to index it and use a LuceneDictionary instead, or possibly a HighFrequencyDictionary , depending on your needs.

The Spellchecker suggests replacements based on the similarity between the words (based on Levenstein Distance ), before any other concern. If you want it to only recommend more popular terms as suggestions, you should pass a SuggestMode to SpellChecker.suggestSimilar . This ensures that matches suggested are at least as strong, popularity-wise, as the word they are intended to replace.

If you must override how Lucene decides on best matches, you can do that with SpellChecker.setComparator , creating your own Comparator on SuggestWord s. Since SuggestWord exposes freq to you, it should be easy to arrange found matches by popularity.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM