简体   繁体   English

solr tika数据导入无法正常工作

[英]solr tika data import is not working properly

When I do dataimport it indexes only one document, even I have many files in the folder. 当我执行dataimport时,它仅索引一个文档,即使文件夹中有很多文件也是如此。

solrconfig.xml solrconfig.xml中

<requestHandler name="/dataimport" class="solr.DataImportHandler">
    <lst name="defaults">
      <str name="config">tika-data-config.xml</str>
    </lst>
  </requestHandler>

Schema.xml Schema.xml的

<field name="id" type="string" indexed="true" stored="true" multiValued="false" /> 
    <field name="fileName" type="string" indexed="true" stored="true" />
    <field name="author" type="string" indexed="true" stored="true" />
    <field name="title" type="string" indexed="true" stored="true" />

    <field name="size" type="long" indexed="true" stored="true" />
    <field name="lastModified" type="tdate" indexed="true" stored="true" />
    <field name="text" type="text_general" indexed="true" stored="true" multiValued="true"/>

tika-data-config.xml 蒂卡数据-config.xml中

<dataConfig>
    <dataSource type="BinFileDataSource" />
    <document>
            <entity name="files" dataSource="null" rootEntity="false"
            processor="FileListEntityProcessor"
            baseDir="C:\Users\vellianm\Documents\BBRC\SearchEngine\solr-5.0.0\example\exampledocs\Process_documents\6.SCIM" fileName=".*\.(pdf)|(PDF)"
            onError="skip"
            recursive="true">
                <field column="fileAbsolutePath" name="id" />
                <field column="fileSize" name="size" />
                <field column="fileLastModified" name="lastModified" />
                <entity
                    name="documentImport"
                    processor="TikaEntityProcessor"
                    url="${files.fileAbsolutePath}"
                    format="text">
                    <field column="file" name="fileName"/>
                    <field column="Author" name="author" meta="true"/>
                    <field column="title" name="title" meta="true"/>
                    <field column="text" name="text"/>
                </entity>
        </entity>
    </document>
</dataConfig>

and when I click the dataimport I get the success message as mentioned below. 当我单击数据导入时,将收到如下所述的成功消息。

Last Update: 15:56:02 Indexing completed. 最后更新:15:56:02索引编制完成。 Added/Updated: 1 documents. 添加/更新:1个文档。 Deleted 0 documents. 删除了0个文档。 Requests: 0, Fetched: 33, Skipped: 0, Processed: 1 Started: about 6 hours ago 请求:0,已提取:33,已跳过:0,已处理:1开始:大约6小时前

Here the fetched documents are 33 but processed are only one. 在这里,提取的文档为33个,但是处理的文档仅为一个。 Also I cant find any error in log file. 我也找不到日志文件中的任何错误。

INFO  - 2015-04-17 09:53:48.957; org.apache.solr.handler.dataimport.DataImporter; Starting Full Import
INFO  - 2015-04-17 09:53:48.959; org.apache.solr.core.SolrCore; [tika] webapp=/solr path=/dataimport params={indent=true&command=status&_=1429264428957&wt=json} status=0 QTime=0 
INFO  - 2015-04-17 09:53:48.962; org.apache.solr.handler.dataimport.SimplePropertiesWriter; Read dataimport.properties
INFO  - 2015-04-17 09:53:48.978; org.apache.solr.update.DirectUpdateHandler2; [tika] REMOVING ALL DOCUMENTS FROM INDEX
INFO  - 2015-04-17 09:53:49.124; org.apache.solr.handler.dataimport.DocBuilder; Import completed successfully

This works for me: 这对我有用:

<dataConfig>  
<dataSource type="BinFileDataSource" />
    <document>
        <entity name="files" dataSource="null" rootEntity="false"
        processor="FileListEntityProcessor"                     
        baseDir="/tmp/docs"
        fileName=".*\.(DOC)|(PDF)|(pdf)|(doc)|(docx)|(ppt)"
        onError="skip"
        recursive="true">

            <field column="fileAbsolutePath" name="path" />
            <field column="fileSize" name="size" />
            <field column="fileLastModified" name="lastModified" />

            <entity
                name="documentImport"
                processor="TikaEntityProcessor"
                url="${files.fileAbsolutePath}"
                format="text">
                <field column="file" name="fileName"/>
                <field column="Author" name="author" meta="true"/>
                <field column="title" name="title" meta="true"/>
                <field column="text" name="text"/>
                <field column="fileAbsolutePath" name="path" />
                <field column="fileSize" name="size" />
                <field column="fileLastModified" name="lastmodified" />                    
                <field column="LastModifiedBy" name="LastModifiedBy" meta="true"/>
            </entity>
    </entity>
    </document> 
</dataConfig>

note the baseline dir ... the quotes are contrary 注意基线目录...引号相反

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM