Storing PDFs in Solr

Question

I'm trying to set things up (in my local environment) so I can store PDFs in Solr, but I cannot get it to work. Right now I'm working with the files in the example folder Solr provides.

I did not modify the solrconfig.xml in solr-3.6.0/example/conf because it seems to already be configured as described in Extracting Request Handler . That is, it already contains this:

<lib dir="../../dist/" regex="apache-solr-cell-\d.*\.jar" />
<lib dir="../../contrib/extraction/lib" regex=".*\.jar" />

And this:

<requestHandler name="/update/extract" 
              startup="lazy"
              class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
  <str name="fmap.content">text</str>
  <str name="lowernames">true</str>
  <str name="uprefix">ignored_</str>
  <str name="captureAttr">true</str>
  <str name="fmap.a">links</str>
  <str name="fmap.div">ignored_</str>
</lst>

I'm running Solr from the example directory with this command:

java -jar start.jar

And I'm trying to send the pdf to Solr with this command:

java -Durl=http://localhost:8983/solr/update/extract -Dauto -jar /Applications/Solr-3.6.0/example/exampledocs/post.jar /path/to/pdf/mypdf.pdf

If I don't make any changes to /Solr-3.6.0/example/solr/conf/schema.xml I get the message:

FATAL: Solr returned an error #400 [doc=null] missing required field: id

If I change the value of the property "required" in the id element in schema.xml to false I get:

FATAL: Solr returned an error #400 Document is missing mandatory uniqueKey field: id

I would think that if the required property of an element is false in the schema then I could just send files that do not contain that field but apparently that is not the case.

I have also tried adding the parameter -Dparams=literal.id=mypdf1 in the command that send that pdf but that doesn't help either. Any thoughts?

Answer 1

I believe my confusion was due to the fact that you need to have an id for the document you are sending to Solr, and at the same time there is an id element in Solr-3.6.0/example/solr/conf/ schema.xml .

I believe the first error I was getting was referring to the id element in the schema. The second error was referring to the document id.

With the help of ZeroPage I was able to overcome the second error as well, by adding the document id to the url instead of passing it as a separate parameter. This query now works for me:

java -Durl=http://localhost:8983/solr/update/extract?literal.id=form1 -jar /Applications/Solr-3.6.0/example/exampledocs/post.jar /path/to/pdf/form1.pdf

If we want Solr to index the full content of the PDF we need to add the uprefix and fmap.content atrributes:

java -Durl="http://localhost:8983/solr/update/extract?literal.id=form1&uprefix=attr_&fmap.content=attr_content&commit=true" -jar /Applications/Solr-3.6.0/example/exampledocs/post.jar /path/to/pdf/form1.pdf

Storing PDFs in Solr

Question

1 answers

solution1
1 ACCPTED 2012-08-30 15:28:10

Storing PDFs in Solr

Question

1 answers

solution1 1 ACCPTED 2012-08-30 15:28:10

solution1
1 ACCPTED 2012-08-30 15:28:10