Apache Solr-索引包含具有特定页码的PDF文件的文件夹

Question

我是Apache Solr搜索技术的新手，希望了解如何在一个文件夹下索引多个PDF文件。

目前，我已经在单独的服务器上安装了Solr 6.6.1。 它工作正常。

请重定向至一些文章或教程，其中包含一些步骤来实现此功能。 我想在文件夹下的所有PDF中搜索一些文本单词，而不指定文件名。 例如，文本搜索应该在特定页码的所有文件夹文件中进行。

例如，我要在该文件夹下所有PDF文件的第5页上搜索“合作伙伴”一词。

Answer 1

最后，在阅读了Apache Solr站点中的文档之后，我找到了方法，而且很简单。 最好和更容易的方法是使用“数据导入处理程序”的方式。 配置文件的名称是data-config.xml

<dataConfig>
  <dataSource type="BinFileDataSource"/> <!--Local filesystem-->
  <document>
    <entity name="K1FileEntity" processor="FileListEntityProcessor" dataSource="null"
            baseDir="C:/solr-6.6.1/server/solr/core_myfiles_Depot/Depot" fileName=".*pdf" rootEntity="false">

            <field column="file" name="id"/>
            <field column="fileAbsolutePath" name="path" />
            <field column="fileSize" name="size" />
            <field column="fileLastModified" name="lastmodified" />

              <entity name="pdf" processor="TikaEntityProcessor" onError="skip" 
                      url="${K1FileEntity.fileAbsolutePath}" format="text">

                <field column="Author" name="author" meta="true"/>
                <!-- in the original PDF, the Author meta-field name is upper-cased,
                  but in Solr schema it is lower-cased -->

                <field column="title" name="title" meta="true"/>
                <field column="dc:format" name="format" meta="true"/>
                <field column="text" name="text"/>

              </entity>
    </entity>
  </document>
</dataConfig>

Apache Solr-索引包含具有特定页码的PDF文件的文件夹

问题描述

1 个解决方案

解决方案1
0 已采纳 2017-10-31 12:32:34

Apache Solr-索引包含具有特定页码的PDF文件的文件夹

问题描述

1 个解决方案

解决方案1 0 已采纳 2017-10-31 12:32:34

解决方案1
0 已采纳 2017-10-31 12:32:34