简体   繁体   English

非英语的搜索/索引问题

[英]Search/Index Issue with non english language

I am trying to index PDF file in solr but looks like while converting text to UTF-8 characters are getting changed. 我正在尝试在solr中为PDF文件编制索引,但看起来在将文本转换为UTF-8字符时已更改。

For example Below highlighted text: 例如,下面突出显示的文本:

演示

Converted to: 转换成:

演示

Search applies on later keyword not original word. 搜索适用于以后的关键字而不是原始单词。 As far as I know this is happening while converting PDF text to UTF-8 before indexing. 据我所知,这是在索引之前将PDF文本转换为UTF-8时发生的。

For reference below is code for indexing: 供参考的以下是索引代码:

String solrUrlString = "http://localhost:8983/solr/example";
    SolrClient solr = new HttpSolrClient(solrUrlString);

    ContentStreamUpdateRequest up = new ContentStreamUpdateRequest("/update/extract");

    up.addFile(new File(filepage.getabsPath()), "application/pdf");

    up.setParam("literal.id", filepage.getId());
    up.setParam("uprefix", "attr_");
    up.setParam("fmap.content", "attr_content");

    up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
    solr.request(up);

The language of the text content i suppose that you are trying to index is Gujarati , one of the Indian dialects. 我想您要索引的文本内容语言是印度方言之一古吉拉特语。 Solr do provide language analysis for a variety of languages , but I'm afraid in terms of Indian languages , it restricts itself to only Hindi . Solr确实提供了多种语言的语言分析,但是恐怕就印度语言而言,它仅限于印地语。 For Hindi, it provides following Analyzer classes classes: solr.IndicNormalizationFilterFactory, solr.HindiNormalizationFilterFactory, solr.HindiStemFilterFactory. 对于印地语,它提供了以下分析器类类:solr.IndicNormalizationFilterFactory,solr.HindiNormalizationFilterFactory,solr.HindiStemFilterFactory。 I cannot see a one for Gujarati in the documentation. 我在文档中看不到古吉拉特语。 You can have a look at language analysis section of Solr here https://cwiki.apache.org/confluence/display/solr/Language+Analysis . 您可以在https://cwiki.apache.org/confluence/display/solr/Language+Analysis上查看Solr的语言分析部分。 So while Gujarati being the language in the question , i suppose analysis would be quite ambiguous, vague and incompatible . 因此,尽管古吉拉特语是问题中的语言,但我认为分析将是非常模棱两可,含糊且不兼容的。 Let me know if you find anything better . 让我知道您是否找到更好的选择。 Hope this helps :) . 希望这可以帮助 :) 。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM