solr PDF 和自动换行

Question

I am using Solr to index PDF documents.我正在使用 Solr 来索引 PDF 个文档。 Everything works well, but there is one problem.一切正常，但有一个问题。 If a word in a PDF document has been wrapped to another line, then it is indexed as part of the word plus a hyphen.如果 PDF 文档中的某个词已被换行到另一行，则将其作为该词的一部分加一个连字符进行索引。 For example, text like this:例如，像这样的文本：

We ran to the beach.我们跑到海边。 We heard more guns, then every-我们听到更多的枪声，然后每-

thing was quiet and a flag went up above the trees.一切都安静了，一面旗帜在树上飘扬。

Here the word everything is broken into parts every- and thing .这里的单词everything被分解成部分every- and thing 。 Now if I search for everything , I will not be able to find this document.现在，如果我搜索所有内容，我将无法找到该文档。 How to do it right in this case?在这种情况下如何正确执行？

Answer 1

With the advice of Abhijit Bashetti and MatsLindh, the problem was solved.在 Abhijit Bashetti 和 MatsLindh 的建议下，问题得以解决。 In my schema.xml I added the line在我的 schema.xml 中，我添加了这一行

<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\-\n" replacement=""/>

After that, the word wrap does not interfere with the search.之后，自动换行不会干扰搜索。

solr PDF 和自动换行

问题描述

1 个解决方案

解决方案1
1 2022-04-20 12:53:50

solr PDF 和自动换行

问题描述

1 个解决方案

解决方案1 1 2022-04-20 12:53:50

解决方案1
1 2022-04-20 12:53:50