简体   繁体   中英

Search/Index Issue with non english language

I am trying to index PDF file in solr but looks like while converting text to UTF-8 characters are getting changed.

For example Below highlighted text:

演示

Converted to:

演示

Search applies on later keyword not original word. As far as I know this is happening while converting PDF text to UTF-8 before indexing.

For reference below is code for indexing:

String solrUrlString = "http://localhost:8983/solr/example";
    SolrClient solr = new HttpSolrClient(solrUrlString);

    ContentStreamUpdateRequest up = new ContentStreamUpdateRequest("/update/extract");

    up.addFile(new File(filepage.getabsPath()), "application/pdf");

    up.setParam("literal.id", filepage.getId());
    up.setParam("uprefix", "attr_");
    up.setParam("fmap.content", "attr_content");

    up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
    solr.request(up);

The language of the text content i suppose that you are trying to index is Gujarati , one of the Indian dialects. Solr do provide language analysis for a variety of languages , but I'm afraid in terms of Indian languages , it restricts itself to only Hindi . For Hindi, it provides following Analyzer classes classes: solr.IndicNormalizationFilterFactory, solr.HindiNormalizationFilterFactory, solr.HindiStemFilterFactory. I cannot see a one for Gujarati in the documentation. You can have a look at language analysis section of Solr here https://cwiki.apache.org/confluence/display/solr/Language+Analysis . So while Gujarati being the language in the question , i suppose analysis would be quite ambiguous, vague and incompatible . Let me know if you find anything better . Hope this helps :) .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM