简体   繁体   中英

Importing files with solr cell/Tika metadata causes a multiple value error

So I'm trying to index documents using Solr CEL and Tika on Solr 5.4.1. I'm using the default configuration, but when I import my docs I'm getting this error:

multiple values encountered for non multiValued field meta: 

Here are the logs relevant to the error and you can see the data I'm providing to solr.

125973 INFO  (qtp840863278-17) [   x:fusearchiver] o.a.s.c.PluginBag Going to create a new requestHandler with {type = requestHandler,name = /update/extract,class = solr.extraction.ExtractingRequestHandler,args = {defaults={lowernames=true,uprefix=ignored_,captureAttr=true,fmap.a=links,fmap.div=ignored_}}} 

127134 INFO  (qtp840863278-17) [   x:fusearchiver] o.a.s.u.p.LogUpdateProcessorFactory [fusearchiver] webapp=/solr path=/update/extract params={literal.archiveDate_dt=Mon+Apr+03+21:16:48+EDT+2017&literal._accountId=2&literal.categories=taxes&literal.categories=5498&literal.id=b5701a36-0dec-4746-bb5d-3c307a557cd7&literal._batchId=25&literal._type=document&literal._filename=2016-0664-Form-5498.pdf&literal._employeeNumber=1411&wt=javabin&literal._employeeFuseId=1&literal.effectiveDate_dt=Sat+Dec+31+00:00:00+EST+2016&literal._json={"accountId":2,"archiveDate":1491268608431,"batchId":25,"categories":["taxes","5498"],"effectiveDate":1483160400000,"employeeFuseId":1,"employeeNumber":"1411","fileName":"2016-0664-Form-5498.pdf","id":"b5701a36-0dec-4746-bb5d-3c307a557cd7","imageUrl":null,"path":"2016-0664-Form-5498.pdf","uploadedBy":null,"url":null}&version=2} {} 0 1161

127135 ERROR (qtp840863278-17) [   x:fusearchiver] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: ERROR: [doc=b5701a36-0dec-4746-bb5d-3c307a557cd7] multiple values encountered for non multiValued field meta: [dcterms:modified, 2017-03-16T23:14:41Z, meta:creation-date, 2017-03-16T23:14:41Z, meta:save-date, 2017-03-16T23:14:41Z, pdf:PDFVersion, 1.4, dcterms:created, 2017-03-16T23:14:41Z, Last-Modified, 2017-03-16T23:14:41Z, date, 2017-03-16T23:14:41Z, X-Parsed-By, org.apache.tika.parser.DefaultParser, X-Parsed-By, org.apache.tika.parser.pdf.PDFParser, modified, 2017-03-16T23:14:41Z, xmpTPg:NPages, 2, Creation-Date, 2017-03-16T23:14:41Z, pdf:encrypted, false, created, Thu Mar 16 23:14:41 UTC 2017, stream_size, null, dc:format, application/pdf; version=1.4, producer, Ricoh Americas Corporation, AFP2PDF, Content-Type, application/pdf, xmp:CreatorTool, Ricoh Americas Corporation, AFP2PDF Plus Version: 1.014.10, Last-Save-Date, 2017-03-16T23:14:41Z]

at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:92)

at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:83)

at org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:273)

at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:207)

at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:169)

at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)

at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:49)

at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:924)

at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1079)

at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:702)

at org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:104)

at org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:126)

at org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:131)

at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:237)

at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:70)

at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:156)

Here is my solrconfig.xml of the extract module:

<requestHandler name="/update/extract" 
            startup="lazy"
            class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
    <str name="lowernames">true</str>
    <str name="uprefix">ignored_</str>

    <!-- capture link hrefs but ignore div attributes -->
    <str name="captureAttr">true</str>
    <str name="fmap.a">links</str>
    <str name="fmap.div">ignored_</str>
  </lst>
</requestHandler>

I thought this would basically mark everything that wasn't a field as ignored so meta shouldn't be imported. I've searched through my solr schema, and I have no meta field declared hence I thought CEL would throw it out.

I'm using Solrj to import the docs. I'm also adding a lot of literals to the document. You can see above the data that I'm providing in literals.

Why am I seeing this error?

Can I simply have it only extract the information and I'll put it in a text field and have it process the HTML in the same manner to work around this issue?

The workaround to this problem was to introduce the following in my solrconfig.xml in the extract's requesthandler config:

<str name="fmap.meta">ignored_</str>

I don't know why I had to explicitly do this. I also had to turn set lowernames = false because my literals were being altered and that caused serious problems for me. This convinced me that I should just run Tika outside of Solr because I'll have more control over it. And I wanted to add tesseract eventually and it seems easier to do that on your own.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM