简体繁体 English

如何在Solr查询中排除从PDF索引的文本

[英]How to exclude text indexed from PDF in solr query

原文 2017-05-29 02:55:20 9 3 pdf/ indexing/ solr

I have a solr index generated from a catalog of PDF files and correspoing metadata fields pertaining to the pdf files themselves. 我有一个从PDF文件目录生成的Solr索引，并具有与PDF文件本身相关的元数据字段。 Still, I would like to provide my users an option to exclude in the query any text indexed from within a PDF. 不过，我仍想为我的用户提供一个选项，以在查询中排除从PDF内索引的任何文本。 This is so the query results would be based on the metadata fields instead and not biased by the vast text within the pdf files. 这样一来，查询结果将基于元数据字段，而不会受到pdf文件中大量文本的影响。

I have thought of maybe having two indexes (cores) - one with the indexed pdf files and one without. 我想过也许有两个索引（核心）-一个带有索引的pdf文件，另一个没有。

Is there another way? 还有另一种方法吗？

3 个解决方案

Sounds like you are doing a general search against a default field. 听起来您正在对默认字段进行常规搜索。 Which means you have a lot of copyField instructions (or just one copyField * -> text), which include the PDF content field. 这意味着您有很多copyField指令（或只有一个copyField *->文本），其中包括PDF内容字段。

You can create a second destination and copyField everything but the PDF content field into that as well. 您可以创建第二个目标，并将除了PDF内容字段之外的所有内容都复制到该字段中。 This way, users can search against or another combined field. 这样，用户可以搜索或搜索另一个组合字段。

However, remember that this parses all content according to the analysis chain of the destination field. 但是，请记住，这将根据目标字段的分析链来解析所有内容。 So, eDisMax with a list of source fields may be a better approach there. 因此，带有源字段列表的eDisMax可能是一种更好的方法。 And, remember, you can use several request handlers (like 'select') and define different default parameters there. 而且，请记住，您可以使用多个请求处理程序（例如“ select”）并在那里定义不同的默认参数。 That usually makes the client code a bit easier. 这通常会使客户端代码更容易些。

You do not need to use 2 separate indexes. 您不需要使用2个单独的索引。 You can use the edismax parser and specify the qf parameter at query time. 您可以使用edismax解析器并在查询时指定qf参数。 That will help determine what fields are searched. 这将有助于确定要搜索的字段。

You can look at field aliases 您可以查看字段别名

If you have 3 index fields 如果您有3个索引字段

pdfmeta pdf元
pdftext pdf文本

Then you can create two field aliases 然后您可以创建两个字段别名

quicksearch : pdfmeta 快速搜索：pdfmeta
fullsearch : pdfmeta, pdftext Fullsearch：pdfmeta，pdftext

One advantage of using a field alias over qf is if your users have bookmarks like q=quicksearch:value , you can change the alias for quicksearch without affecting the user's bookmark. 在qf上使用字段别名的一个优势是，如果您的用户具有q = quicksearch：value之类的书签，则可以更改别名以进行快速搜索，而不会影响用户的书签。