简体   繁体   中英

Solr index xml file with html tag (with DataImportHandler)

I have Solr 4.10.4 and I would like to index a xml file. Somes xml tags contain html tags.

<?xml version='1.0' encoding='UTF-8' standalone='no' ?>
<root>
   <info>
        <text>
             <p>text 1</p>
             <p>text 2</p>
             <p>text 3</p> 
        </text> 
   </info> 
</root>

I used this :

<charFilter class="solr.HTMLStripCharFilterFactory"/>

but it does not work and I don't know what is wrong.

M.

HTMLStripCharFilterFactory is going to strip the HTML tage from indexed data not from the stored.
To strip html tags while indexing you can use HTMLStripTransformer in dataimporthandler. Below is the sample DIH for the same.

<dataConfig>
<dataSource name="fDS" type="FileDataSource" />
<document>
    <entity name="tika-test" processor="XPathEntityProcessor"
            url="${solr.install.dir}/example/exampledocs/content.xml" forEach="/root" dataSource="fDS">
            <field column="text" xpath="/root/info/text/p" />
    </entity>
</document>

There is one attribute for this transformer, stripHTML, which is a boolean value (true/false) to signal if the HTMLStripTransformer should process the field or not.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM