简体繁体中英

How to exclude text indexed from PDF in solr query

原文 2017-05-29 02:55:20 9 3 pdf/ indexing/ solr

I have a solr index generated from a catalog of PDF files and correspoing metadata fields pertaining to the pdf files themselves. Still, I would like to provide my users an option to exclude in the query any text indexed from within a PDF. This is so the query results would be based on the metadata fields instead and not biased by the vast text within the pdf files.

I have thought of maybe having two indexes (cores) - one with the indexed pdf files and one without.

Is there another way?

3 answers

Sounds like you are doing a general search against a default field. Which means you have a lot of copyField instructions (or just one copyField * -> text), which include the PDF content field.

You can create a second destination and copyField everything but the PDF content field into that as well. This way, users can search against or another combined field.

However, remember that this parses all content according to the analysis chain of the destination field. So, eDisMax with a list of source fields may be a better approach there. And, remember, you can use several request handlers (like 'select') and define different default parameters there. That usually makes the client code a bit easier.

You do not need to use 2 separate indexes. You can use the edismax parser and specify the qf parameter at query time. That will help determine what fields are searched.

You can look at field aliases

If you have 3 index fields

pdfmeta
pdftext

Then you can create two field aliases

quicksearch : pdfmeta
fullsearch : pdfmeta, pdftext

One advantage of using a field alias over qf is if your users have bookmarks like q=quicksearch:value , you can change the alias for quicksearch without affecting the user's bookmark.

Solr: Find words count for 'text' field of an indexed pdf document

No results when searching indexed PDF with Solr Cell

How to displaly PDF files which were indexed by solr in a Angular app with a node express API

How to index pdf files from HDFS to Solr

Solr query in a pdf file, is not returning highlighting content

How to extract text from PDF?

Solr: store Text Layout from extrected pdf with tika / extract request handler

How to print the actual content of a pdf which matches the search query in solr 7.6.0

How to index PDF Document on Apache Solr

How to transfer OCR text from one PDF to another PDF?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Solr: Find words count for 'text' field of an indexed pdf document No results when searching indexed PDF with Solr Cell How to displaly PDF files which were indexed by solr in a Angular app with a node express API How to index pdf files from HDFS to Solr Solr query in a pdf file, is not returning highlighting content How to extract text from PDF? Solr: store Text Layout from extrected pdf with tika / extract request handler How to print the actual content of a pdf which matches the search query in solr 7.6.0 How to index PDF Document on Apache Solr How to transfer OCR text from one PDF to another PDF?

Related Tags

How to exclude text indexed from PDF in solr query

Question

3 answers

solution1
1 2017-05-29 11:16:17

solution2
0 2017-05-29 09:01:08

solution3
0 2017-05-29 15:46:44

How to exclude text indexed from PDF in solr query

Question

3 answers

solution1 1 2017-05-29 11:16:17

solution2 0 2017-05-29 09:01:08

solution3 0 2017-05-29 15:46:44

solution1
1 2017-05-29 11:16:17

solution2
0 2017-05-29 09:01:08

solution3
0 2017-05-29 15:46:44