简体   繁体   中英

Lucene indexing without html css tags java

I am using lucene to index my data using java programming language. But still, when i retrieve the terms that lucene indexed, they appear with tags like html (html is considered as a term not a tag and lucene doesn't remove it). Is there any code or library for example like English analyzer that can remove the desired html tags?

If you want to remove html tags before indexing them in Lucene, you might use PatternReplaceCharFilter . It uses a regular expression for the target of replace string.

You could create char filter like this:

CharFilter cf = new PatternReplaceCharFilter(Pattern.compile("<[^>]*>"), "", reader);

this, will replace all html tags with empty string, so it will be removed.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM