简体   繁体   English

使用Apache Solr搜索名称

[英]Searching names with Apache Solr

I've just ventured into the seemingly simple but extremely complex world of searching. 我只是冒险进入看似简单但非常复杂的搜索世界。 For an application, I am required to build a search mechanism for searching users by their names. 对于应用程序,我需要构建一个搜索机制,以按名称搜索用户。

After reading numerous posts and articles including: 阅读了大量的帖子和文章,包括:

How can I use Lucene for personal name (first name, last name) search? 如何使用Lucene进行个人姓名(名字,姓氏)搜索?
http://dublincore.org/documents/1998/02/03/name-representation/ http://dublincore.org/documents/1998/02/03/name-representation/
what's the best way to search a social network by prioritizing a users relationships first? 通过优先考虑用户关系来搜索社交网络的最佳方式是什么?
http://www.gossamer-threads.com/lists/lucene/java-user/120417 http://www.gossamer-threads.com/lists/lucene/java-user/120417
Lucene Index and Query Design Question - Searching People Lucene索引和查询设计问题 - 搜索人
Lucene Fuzzy Search for customer names and partial address Lucene模糊搜索客户名称和部分地址

... and a few others I cannot find at-the-moment. ......以及其他一些我现在无法找到的人。 And getting at-least indexing and basic search working in my machine I have devised the following scheme for user searching: 在我的机器上进行至少索引和基本搜索工作我已经为用户搜索设计了以下方案:

1) Have a first, second and third name field and index those with Solr 1)拥有第一个,第二个和第三个名称字段并使用Solr索引那些字段
2) Use edismax as the requestParser for multi column searching 2)使用edismax作为多列搜索的requestParser
3) Use a combination of normalization filters such as: transliteration, latin-to-ascii convesrion, etc. 3)使用标准化过滤器的组合,例如:音译,拉丁语到ascii convesrion等。
4) Finally use fuzzy search 4)最后使用模糊搜索

Evidently, being very new to this I am unsure if the above is the best way to do it and would like to hear from experienced users who have a better idea than me in this field. 很明显,对于这方面的新手,我不确定上述是否是最好的方法,并希望听到在这个领域比我更有想法的有经验的用户。

I need to be able to match names in the following ways: 我需要能够通过以下方式匹配名称:

1) Accent folding: Jorn matches Jörn and vise versa 1)口音折叠:Jorn与Jörn匹配,反之亦然
2) Alternative spellings: Karl matches Carl and vice versa 2)替代拼写:卡尔与卡尔匹配,反之亦然
3) Shortened representations (I believe I do this with the SynonymFilterFactory): Sue matches Susanne, etc. 3)缩短的陈述(我相信我是用SynonymFilterFactory做的):Sue匹配Susanne等。
4) Levenstein matching: Jonn matches John, etc. 4)Levenstein匹配:Jonn匹配John等
5) Soundex matching: Elin and Ellen 5)Soundex匹配:Elin和Ellen

Any guidance, criticisms or comments are very welcome. 任何指导,批评或评论都是非常受欢迎的。 Please let me know if this is possible ... or perhaps I'm just day-dreaming. 如果可能的话请告诉我......或者我只是白日做梦。 :) :)


EDIT 编辑

I must also add that I also have a fullname field in case some people have long names, as an example from one of the posts: Jon Paul or Del Carmen should also match Jon Paul Del Carmen 我还必须补充一点,我还有一个全名字段,以防有些人有长名字,作为其中一个帖子的例子:Jon Paul或Del Carmen也应该匹配Jon Paul Del Carmen

And since this is a new project, I can modify the schema and architecture any way I see fit so there are very limited restrictions. 由于这是一个新项目,我可以以任何我认为合适的方式修改架构和架构,因此限制非常有限。

It sounds like you are catering for a corpus with searches that you need to match very loosely? 听起来你正在为一个语料库提供一些你需要非常松散匹配的搜索?

If you are doing that you will want to choose your fields and set different boosts to rank your results. 如果您这样做,您将需要选择字段并设置不同的提升来对结果进行排名。

So have separate "copied" fields in solr: 所以在solr中有单独的“复制”字段:

  • one field for exact full name (with filters) 一个字段用于确切的全名(带过滤器)
  • multivalued field with filters ASCIIFolding, Lowercase... 带过滤器的多值字段ASCIIFolding,小写......
  • multivalued field with the SynonymFilterFactory ASCIIFolding, Lowercase... 带有SynonymFilterFactory的多值字段ASCIIFolding,小写...
  • PhoneticFilterFactory (with Caverphone or Double-Metaphone ) PhoneticFilterFactory(带CaverphoneDouble-Metaphone

See Also: more non-english Soundex discussion 另请参阅:更多非英语Soundex讨论

Synonyms for names, I don't know if there is a public synonym db available. 名称的同义词,我不知道是否有公共同义词db。

Fuzzy searching, I've not found it useful, it uses Levenshtein Distance. 模糊搜索,我发现它没有用,它使用Levenshtein距离。

Other filters and indexing get more superior "search relevant" results. 其他过滤器和索引获得更优越的“搜索相关”结果。

Unicode characters in names can be handled with the ASCIIFoldingFilterFactory 可以使用ASCIIFoldingFilterFactory处理名称中的Unicode字符

You are describing solutions up front for expected use cases. 您正在为预期的用例预先描述解决方案。

If you want quality results, plan on tuning your Search Relevance 如果您想获得高质量的结果,请计划调整您的搜索相关性

This tuning will be especially valuable, when attempting to match on synonyms, like MacDonald and McDonald (which has a larger Levenshtein distance than Carl and Karl). 当尝试匹配同义词时,这种调整将特别有价值,例如MacDonald和McDonald(其具有比Carl和Karl更大的Levenshtein距离)。

Found a nickname db, not sure how good: http://www.peacockdata2.com/products/pdnickname/ 找到一个昵称数据库,不确定有多好: http//www.peacockdata2.com/products/pdnickname/

Note that it's not free. 请注意,它不是免费的。

The answer in another post is pretty good: Training solr to recognize nicknames or name variants 另一篇文章中的答案非常好: 训练solr识别昵称或名称变体

 <fieldType name="name_en" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.ASCIIFoldingFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.ASCIIFoldingFilterFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="english_names.txt" ignoreCase="true" expand="true"/> </analyzer> </fieldType> 

For phonetic name search you might also try the Beider-Morse Filter which works pretty well if you have a mixture of names from different countries. 对于语音名称搜索,您也可以尝试使用Beider-Morse过滤器 ,如果您混合使用来自不同国家/地区的名称,它可以很好地运行。

If you want to use it with a typeahead feature, combine it with an EdgeNGramFilter: 如果要将其与typeahead功能一起使用,请将其与EdgeNGramFilter结合使用:

<fieldType name="phoneticNames" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.BeiderMorseFilterFactory" nameType="GENERIC" ruleType="APPROX" concat="true" languageSet="auto"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.BeiderMorseFilterFactory" nameType="GENERIC" ruleType="APPROX" concat="true" languageSet="auto"/>
  </analyzer>
</fieldType>

We created a simple 'name' field type that allows mixing both 'key' (eg, SOUNDEX) and 'pairwise' portions of the answers above. 我们创建了一个简单的“名称”字段类型,允许混合上面答案的“关键”(例如,SOUNDEX)和“成对”部分。

Here's the overview: 这是概述:

  1. at index time, fields of the custom type are indexed into a set of (sub) fields with respective values used for high-recall matching different kinds of variations 在索引时,自定义类型的字段被索引到一组(子)字段中,这些字段具有用于高调用匹配不同类型变体的相应值

Here's the core of its implementation... 这是其实施的核心......

List<IndexableField> createFields(SchemaField field, String name) {
        Collection<FieldSpec> nameFields = deriveFieldsForName(name);
        List<IndexableField> docFields = new ArrayList<>();
        for (FieldSpec fs : nameFields) {
            docFields.add(new Field(fs.getName(), fs.getStringValue(),
                         fs.getLuceneField()));
        }
        docFields.add(createDocValues(field.getName(), new Name(name)));
        return docFields;
}

The heart of this is deriveFieldsForName(name) in which you can include 'keys' from PhoneticFilters, LowerCaseFolding, etc. 其核心是deriveFieldsForName(name),您可以在其中包含来自PhoneticFilters,LowerCaseFolding等的“键”。

  1. at query time, first a custom Lucene query is produced that has been tuned for recall and that uses the same fields as index time 在查询时,首先生成一个自定义的Lucene查询,该查询已被调整以进行调用,并使用与索引时间相同的字段

Here's the core of its implementation... 这是其实施的核心......

public Query getFieldQuery(QParser parser, SchemaField field, String val) {
        Name name = parseNameString(externalVal, parser.getParams());
        QuerySpec querySpec = buildQuery(name);
        return querySpec.accept(new SolrQueryVisitor(field.getName())); 
}

The heart of this is the buildQuery(name) method which should produce a query that is aware of deriveFieldsForName(name) above so for a given query name it will find good candidate names. 其核心是buildQuery(name)方法,该方法应该生成一个知道上面的deriveFieldsForName(name)的查询,因此对于给定的查询名称,它将找到好的候选名称。

  1. then second, Solr's Rerank feature is used to apply a high-precision re-scoring algorithm to reorder the results 然后,Solr的Rerank功能用于应用高精度重新评分算法来重新排序结果

Here's what this looks like in your query... 以下是您查询中的内容...

&rq={!myRerank reRankQuery=$rrq} &rrq={!func}myMatch(fieldName, "John Doe")

The content of myMatch could have a pairwise Levenstein or Jaro-Winkler implementation. myMatch的内容可能有成对的Levenstein或Jaro-Winkler实现。

NB Our own full implementation uses proprietary code for deriveFieldsForName, buildQuery, and myMatch (see http://www.basistech.com/text-analytics/rosette/name-indexer/ ) to handle more kinds of variations that the ones mentioned above (eg, missing spaces, cross-language). 注意我们自己的完整实现使用deriveFieldsForName,buildQuery和myMatch的专有代码(请参阅http://www.basistech.com/text-analytics/rosette/name-indexer/ )来处理上面提到的更多种类的变体(例如,缺少空格,跨语言)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM