简体   繁体   English

solr PDF 和自动换行

[英]solr PDF and word wrap

I am using Solr to index PDF documents.我正在使用 Solr 来索引 PDF 个文档。 Everything works well, but there is one problem.一切正常,但有一个问题。 If a word in a PDF document has been wrapped to another line, then it is indexed as part of the word plus a hyphen.如果 PDF 文档中的某个词已被换行到另一行,则将其作为该词的一部分加一个连字符进行索引。 For example, text like this:例如,像这样的文本:

We ran to the beach.我们跑到海边。 We heard more guns, then every-我们听到更多的枪声,然后每-

thing was quiet and a flag went up above the trees.一切都安静了,一面旗帜在树上飘扬。

Here the word everything is broken into parts every- and thing .这里的单词everything被分解成部分every- and thing Now if I search for everything , I will not be able to find this document.现在,如果我搜索所有内容,我将无法找到该文档。 How to do it right in this case?在这种情况下如何正确执行?

With the advice of Abhijit Bashetti and MatsLindh, the problem was solved.在 Abhijit Bashetti 和 MatsLindh 的建议下,问题得以解决。 In my schema.xml I added the line在我的 schema.xml 中,我添加了这一行

<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\-\n" replacement=""/>

After that, the word wrap does not interfere with the search.之后,自动换行不会干扰搜索。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM