具有html标记的Solr索引xml文件（带有DataImportHandler）

Question

I have Solr 4.10.4 and I would like to index a xml file. 我有Solr 4.10.4，我想索引一个xml文件。 Somes xml tags contain html tags. 某些xml标签包含html标签。

<?xml version='1.0' encoding='UTF-8' standalone='no' ?>
<root>
   <info>
        <text>
             <p>text 1</p>
             <p>text 2</p>
             <p>text 3</p> 
        </text> 
   </info> 
</root>

I used this : 我用这个：

<charFilter class="solr.HTMLStripCharFilterFactory"/>

but it does not work and I don't know what is wrong. 但这是行不通的，我也不知道哪里出了问题。

M. M.

Answer 1

HTMLStripCharFilterFactory is going to strip the HTML tage from indexed data not from the stored. HTMLStripCharFilterFactory将从被索引的数据而不是从存储的数据中剥离HTML 年龄。
To strip html tags while indexing you can use HTMLStripTransformer in dataimporthandler. 要在索引时去除 html标签，可以在dataimporthandler中使用HTMLStripTransformer 。 Below is the sample DIH for the same. 以下是相同的示例DIH。

<dataConfig>
<dataSource name="fDS" type="FileDataSource" />
<document>
    <entity name="tika-test" processor="XPathEntityProcessor"
            url="${solr.install.dir}/example/exampledocs/content.xml" forEach="/root" dataSource="fDS">
            <field column="text" xpath="/root/info/text/p" />
    </entity>
</document>

There is one attribute for this transformer, stripHTML, which is a boolean value (true/false) to signal if the HTMLStripTransformer should process the field or not. 此转换器有一个属性stripHTML，它是一个布尔值（真/假），用于表示HTMLStripTransformer是否应处理该字段。

具有html标记的Solr索引xml文件（带有DataImportHandler）

问题描述

1 个解决方案

解决方案1
0 2016-09-27 12:47:20

具有html标记的Solr索引xml文件（带有DataImportHandler）

问题描述

1 个解决方案

解决方案1 0 2016-09-27 12:47:20

解决方案1
0 2016-09-27 12:47:20