简体   繁体   中英

solr exclude html class from indexing

Im indexing a knowledgebase with solr. The problem is, that the menu is indexed as well, so searching for a term used in the menu returns all pages.
Can I somehow tell solr to exclude a special html class from indexing?
HTML-Tags are removed, so I cant find the specified element later.


EDIT:
I added a short sample for what I want to achieve.
That is, to exclude certain html nodes (like my navigation) from beeing indexed.

Sample html:

<nav>
    <ul>
        <li>topic-1</li>
        <li>topic-2</li>
        <li>topic-3</li>
    </ul>
</nav>
<main>
    <h1>Topic-1</h1>
    <p>Lorem ipsum dolor sit ament...</p>
</main>

What I currently get in my index from that:

topic-1
topic-2
topic-3

Topic-1
lorem ipsum dolor sit ament...

What I want to get in my index fom that:

Topic-1
lorem ipsum dolor sit ament...

You basically want to remove some of the text. You can do it on the field itself with PatternReplace Character Filter , which sits before the Tokenizer in the field type definition. That will keep it in the stored version of the field though.

Or, you could go earlier in the indexing process, and use UpdateRequestProcessor to modify the field before it is even looked at for indexing. You'd want RegexReplace URP for that.

Use HTMLStripCharFilterFactory, which will strip HTML tags:

<analyzer>
  <charFilter class="solr.HTMLStripCharFilterFactory"/>
  <tokenizer class="solr.StandardTokenizerFactory"/>
</analyzer>

Let me know if it works for yor.

Here you will find more info on the same.

https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory

Use the XPathEntityProcessor to extract a subset of the document, matched by the provided XPath pattern.

That way you can index the actual content you want in the page (as long as it's valid XML), and ignore other common stuff such as headers/footers/etc. as well.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM