简体   繁体   中英

How to exclude text indexed from PDF in solr query

I have a solr index generated from a catalog of PDF files and correspoing metadata fields pertaining to the pdf files themselves. Still, I would like to provide my users an option to exclude in the query any text indexed from within a PDF. This is so the query results would be based on the metadata fields instead and not biased by the vast text within the pdf files.

I have thought of maybe having two indexes (cores) - one with the indexed pdf files and one without.

Is there another way?

Sounds like you are doing a general search against a default field. Which means you have a lot of copyField instructions (or just one copyField * -> text), which include the PDF content field.

You can create a second destination and copyField everything but the PDF content field into that as well. This way, users can search against or another combined field.

However, remember that this parses all content according to the analysis chain of the destination field. So, eDisMax with a list of source fields may be a better approach there. And, remember, you can use several request handlers (like 'select') and define different default parameters there. That usually makes the client code a bit easier.

You do not need to use 2 separate indexes. You can use the edismax parser and specify the qf parameter at query time. That will help determine what fields are searched.

You can look at field aliases

If you have 3 index fields

  • pdfmeta
  • pdftext

Then you can create two field aliases

  • quicksearch : pdfmeta
  • fullsearch : pdfmeta, pdftext

One advantage of using a field alias over qf is if your users have bookmarks like q=quicksearch:value , you can change the alias for quicksearch without affecting the user's bookmark.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM