Solr index xml file with html tag (with DataImportHandler)

Question

I have Solr 4.10.4 and I would like to index a xml file. Somes xml tags contain html tags.

<?xml version='1.0' encoding='UTF-8' standalone='no' ?>
<root>
   <info>
        <text>
             <p>text 1</p>
             <p>text 2</p>
             <p>text 3</p> 
        </text> 
   </info> 
</root>

I used this :

<charFilter class="solr.HTMLStripCharFilterFactory"/>

but it does not work and I don't know what is wrong.

M.

Answer 1

HTMLStripCharFilterFactory is going to strip the HTML tage from indexed data not from the stored.
To strip html tags while indexing you can use HTMLStripTransformer in dataimporthandler. Below is the sample DIH for the same.

<dataConfig>
<dataSource name="fDS" type="FileDataSource" />
<document>
    <entity name="tika-test" processor="XPathEntityProcessor"
            url="${solr.install.dir}/example/exampledocs/content.xml" forEach="/root" dataSource="fDS">
            <field column="text" xpath="/root/info/text/p" />
    </entity>
</document>

There is one attribute for this transformer, stripHTML, which is a boolean value (true/false) to signal if the HTMLStripTransformer should process the field or not.

Solr index xml file with html tag (with DataImportHandler)

Question

1 answers

solution1
0 2016-09-27 12:47:20

Solr index xml file with html tag (with DataImportHandler)

Question

1 answers

solution1 0 2016-09-27 12:47:20

solution1
0 2016-09-27 12:47:20