Solr ExtractingRequestHandler giving empty content field

Question

I'm using Solr 6.2.1 and ExtractingRequestHandler (already included in Solr 6.2.1) to index pdf and word documents. All documents (pdf and word) are indexed with metadata (title, date, cp_revision, compagny, ...) but the content field is always empty.

According to the documentation I should have a non-empty content field : "Tika adds all the extracted text to the content field."

Has anybody know why the content field is empty ? According to this post answer it's maybe because I open my file in a non-binary mode but how to do it in binary mode ?

This is my solrconfig.xml file :

<lib dir="${solr.install.dir:../../../..}/contrib/extraction/lib" regex=".*\.jar" />
<lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-cell-\d.*\.jar" />

...

<requestHandler name="/update/extract"
              startup="lazy"
              class="solr.extraction.ExtractingRequestHandler" >
  <lst name="defaults">
    <str name="xpath">/xhtml:html/xhtml:body/descendant:node()</str>
    <str name="capture">content</str>
    <str name="fmap.meta">attr_meta_</str>
    <str name="uprefix">attr_</str>
    <str name="lowernames">true</str>
  </lst>
</requestHandler>

Answer 1

Try indexing with the files example in the examples/files , it is designed to parse rich-text format. If that works, you can figure out what goes wrong in your definition. I suspect the xpath parameter may be wrong and returning just empty content.

Answer 2

I was using the solr:alpine Docker image and had the same problem. Turns out the "content" field was getting mapped to Solr's "text" field which is indexed but not stored by default. See if "fmap.content=doc_content" in Curl does the trick.

Answer 3

I was having a similar problem and I fixed by setting the /update/extracthandler request handler to this:

<requestHandler name="/update/extract"
              startup="lazy"
              class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
  <str name="lowernames">true</str>
  <str name="fmap.meta">ignored_</str>
  <str name="fmap.content">content</str>
  <str name="update.chain">uuid</str>
</lst>

The key part being the content where it maps the Tika obtained contents to your "content" field, which must be defined in your schema, probably as stored=true

Solr ExtractingRequestHandler giving empty content field

Question

3 answers

solution1
0 2016-10-20 20:51:19

solution2
0 2017-11-19 18:24:16

solution3
0 2018-12-28 09:36:34

Solr ExtractingRequestHandler giving empty content field

Question

3 answers

solution1 0 2016-10-20 20:51:19

solution2 0 2017-11-19 18:24:16

solution3 0 2018-12-28 09:36:34

solution1
0 2016-10-20 20:51:19

solution2
0 2017-11-19 18:24:16

solution3
0 2018-12-28 09:36:34