简体   繁体   中英

Fuzzy Queries in Lucene

I am using Lucene in JAVA and indexing a table in our database based on company name. After the index I wish to do a fuzzy match (Levenshtein distance) on a value we wish to input into the database. The reason is that we do not want to be entering dupes because of spelling errors.

For example if I have the company name "Widget Makers XYZ" I don't want to insert "Widget Maker XYZ".

From what I've read Lucene's fuzzy match algorithm should give me a number between 0 and 1, I want to do some testing and then determine and adequate value for us determine what is valid or invalid.

The problem is I am stuck, and after searching what seems like everywhere on the internet, need the StackOverflow community's help.

Like I said I have indexed the database on company name, and then have the following code:

IndexSearcher searcher = new IndexSearcher(directory);  

new QueryParser(Version.LUCENE_30, "company", analyzer);

Query fuzzy_query = new FuzzyQuery(new Term("company", "Center"));

I encounter the problem afterwards, basically I do not know how to get the fuzzy match value. I know the code must look something like the following, however no collectors seem to fit my needs. (As you can see right now I am only able to count the number of matches, which is useless to me)

TopScoreDocCollector collector = TopScoreDocCollector.create(10, true);

searcher.search(fuzzy_query, collector);

System.out.println("\ncollector.getTotalHits() = " + collector.getTotalHits());

Also I am unable to use the ComplexPhraseQueryParser class which is shown in the Lucene documentation. I am doing:

import org.apache.lucene.queryParser.*;

Does anybody have an idea as to why its inaccessible or what I am doing wrong? Apologies for the length of the question.

You do not need Lucene to get the score. Take a look at Simmetrics library , it is exceedingly simple to use. Just add the jar and use it thus:

Levenstein ld = new Levenstein ();
float sim = ld.GetSimilarity(string1, string2);

Also do note, depending on the type of data (ie longer strings, # whitespaces etc.), you might want to look at other algorithms such as Jaro-Winkler, Smith-Waterman etc.

You could use the above to determine to collapse fuzzy duplicate strings into one "master" string and then index.

You can get the match values with:

TopDocs topDocs = collector.topDocs();
for(ScoreDoc scoreDoc : topDocs.scoreDocs) {
    System.out.println(scoreDoc.score);
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM