简体   繁体   中英

SOLR tika processor not crawling my PDF files prefectly

Hi fellow SOLR developers,

I have some pdf files which has some circuit diagrams. There are some text written vertically over the circuits. For instance, there is a word "junction connector" marked in the pdf, vertically, over a circuit stretch, which when indexed into SOLR becomes "junction C onnector".

The search is not happening on the given keywords, for obvious reasons. Is it possible to change the underlying processor?

I tried to convert the pdf to text using 'itextpdf' in a standalone java class and 'itextpdf' prints the text decent enough. When I read the same pdf using 'Apache Tika', I see a lot of words broken with spaces, similar to the what SOLR does, obviously.

Is it even possible to develop and integrate a 'itextpdf' entity processor, for instance? or any other custom entity processor?

My worst alternative way is using solrj and reading the pdf and indexing it, but like mentioned, that is going to be my worst case alternative, because of environment and design constraints.

Using SOLR 5.3.1

I'm using the tika processor right now,

<dataConfig>
<dataSource type="BinFileDataSource" />
<document>
    <entity name="tika-test" processor="TikaEntityProcessor"
            url="C:\Users\12345\Downloads\workspace\Playground\circuits.pdf" format="text">
            <field column="Author" name="creator" meta="true"/>
            <field column="title" name="producer" meta="true"/>
            <field column="text" name="text"/>
    </entity>
</document>

The way SOLR index the documents is like this,

P ower Sou rc e T he ft D e te rre ntand W ire le ss D oor L ock C on tro l Turn Signal Flasher <6 –5 > DHEJ T–OV–R DJF C ombination M eter

The easiest (and not really the worst case alternative) way would be to write a small itextpdf submission module yourself, that scans a directory and uses SolrJ to submit the extracted text to Solr. This will allow for easier customization and parallelization of the indexing process in the future as well (running the extraction and indexing process on more than just one server).

The Tika extract handler will probably be moved out from Solr core and into a separate index tool at some time in the future anyways.

It would be possible to write a separate daemon that you can submit documents to and that has the different indexing strategies in the future, but there hasn't been done any work related to that yet.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM