简体   繁体   中英

Lucene.Net. How to search through HTML entities

How to search through html entities in lucene.net?

All my index in numeric html entities, so if I search for example "34" it comes &#<b>34</b>;

Also very interesting, how to make search through different fields with different words like in SQL. for example search phrase "word1 word2"

SELECT * FROM table WHERE 
title LIKE 'word1%' OR title LIKE 'word2%' OR 
description LIKE'word1%' OR description LIKE 'word2%'

It comes down to how you store it. When you store your document, it appears you're storing your HTML and searching on it.

I recommend that you have two separate fields:

  • One stores the raw HTML, but it is not analysed (there's no need to search on the markup, is there?)
  • One contains the HTML that is processed for searching. This field is not stored but it is analyzed.

In order to populate the second field, you should run the HTML through something like HTML Agility Pack to get the inner text of the HTML nodes you're storing/processing, and then run that text through the HttpUtility.HtmlDecode method to get the text that the HTML entities represent which you can actually analyze and search on.

Then, you can search on the analyzed field for whatever you wish without doing anything special, and then retrieve the content from the field that stores the raw HTML.

In regards to wildcard searches, they are supported, you just have to build your query appropriately (assuming you are using a QueryParser ). Note that wildcard prefixes are not enabled by default.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM