简体   繁体   中英

Apache Solr - Index the folder having PDF files with particular page number

I am new to Apache Solr search technology and wishes to understand how can we index multiple PDF files under a folder.

Currently i have installed Solr 6.6.1 on a separate server. It is working fine as expected.

Please redirect me to some article or tutorial having some steps to achieve this thing. I want to search some text words in all the PDF's under a folder without specifying the filename. For instance, the text search should happen in all the folder files on a particular page no.

For example, i want to search the word "Partner" on Page 5 in all the PDF files under the folder.

Finally, i found the way after reading the documentation in the Apache Solr site and its easy. The best and easier way is to use "Data Import Handler" way. The name of the config file is data-config.xml

<dataConfig>
  <dataSource type="BinFileDataSource"/> <!--Local filesystem-->
  <document>
    <entity name="K1FileEntity" processor="FileListEntityProcessor" dataSource="null"
            baseDir="C:/solr-6.6.1/server/solr/core_myfiles_Depot/Depot" fileName=".*pdf" rootEntity="false">

            <field column="file" name="id"/>
            <field column="fileAbsolutePath" name="path" />
            <field column="fileSize" name="size" />
            <field column="fileLastModified" name="lastmodified" />

              <entity name="pdf" processor="TikaEntityProcessor" onError="skip" 
                      url="${K1FileEntity.fileAbsolutePath}" format="text">

                <field column="Author" name="author" meta="true"/>
                <!-- in the original PDF, the Author meta-field name is upper-cased,
                  but in Solr schema it is lower-cased -->

                <field column="title" name="title" meta="true"/>
                <field column="dc:format" name="format" meta="true"/>
                <field column="text" name="text"/>

              </entity>
    </entity>
  </document>
</dataConfig>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM