简体   繁体   English

Apache Solr-索引包含具有特定页码的PDF文件的文件夹

[英]Apache Solr - Index the folder having PDF files with particular page number

I am new to Apache Solr search technology and wishes to understand how can we index multiple PDF files under a folder. 我是Apache Solr搜索技术的新手,希望了解如何在一个文件夹下索引多个PDF文件。

Currently i have installed Solr 6.6.1 on a separate server. 目前,我已经在单独的服务器上安装了Solr 6.6.1。 It is working fine as expected. 它工作正常。

Please redirect me to some article or tutorial having some steps to achieve this thing. 请重定向至一些文章或教程,其中包含一些步骤来实现此功能。 I want to search some text words in all the PDF's under a folder without specifying the filename. 我想在文件夹下的所有PDF中搜索一些文本单词,而不指定文件名。 For instance, the text search should happen in all the folder files on a particular page no. 例如,文本搜索应该在特定页码的所有文件夹文件中进行。

For example, i want to search the word "Partner" on Page 5 in all the PDF files under the folder. 例如,我要在该文件夹下所有PDF文件的第5页上搜索“合作伙伴”一词。

Finally, i found the way after reading the documentation in the Apache Solr site and its easy. 最后,在阅读了Apache Solr站点中的文档之后,我找到了方法,而且很简单。 The best and easier way is to use "Data Import Handler" way. 最好和更容易的方法是使用“数据导入处理程序”的方式。 The name of the config file is data-config.xml 配置文件的名称是data-config.xml

<dataConfig>
  <dataSource type="BinFileDataSource"/> <!--Local filesystem-->
  <document>
    <entity name="K1FileEntity" processor="FileListEntityProcessor" dataSource="null"
            baseDir="C:/solr-6.6.1/server/solr/core_myfiles_Depot/Depot" fileName=".*pdf" rootEntity="false">

            <field column="file" name="id"/>
            <field column="fileAbsolutePath" name="path" />
            <field column="fileSize" name="size" />
            <field column="fileLastModified" name="lastmodified" />

              <entity name="pdf" processor="TikaEntityProcessor" onError="skip" 
                      url="${K1FileEntity.fileAbsolutePath}" format="text">

                <field column="Author" name="author" meta="true"/>
                <!-- in the original PDF, the Author meta-field name is upper-cased,
                  but in Solr schema it is lower-cased -->

                <field column="title" name="title" meta="true"/>
                <field column="dc:format" name="format" meta="true"/>
                <field column="text" name="text"/>

              </entity>
    </entity>
  </document>
</dataConfig>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM