Lucene indexing without html css tags java

Question

I am using lucene to index my data using java programming language. But still, when i retrieve the terms that lucene indexed, they appear with tags like html (html is considered as a term not a tag and lucene doesn't remove it). Is there any code or library for example like English analyzer that can remove the desired html tags?

Answer 1

If you want to remove html tags before indexing them in Lucene, you might use PatternReplaceCharFilter . It uses a regular expression for the target of replace string.

You could create char filter like this:

CharFilter cf = new PatternReplaceCharFilter(Pattern.compile("<[^>]*>"), "", reader);

this, will replace all html tags with empty string, so it will be removed.

Lucene indexing without html css tags java

Question

1 answers

solution1
0 2019-10-13 18:19:30

Lucene indexing without html css tags java

Question

1 answers

solution1 0 2019-10-13 18:19:30

solution1
0 2019-10-13 18:19:30