Lucene 无索引 html css 标签 Z93F725A07423FE1C889F448B33D21F6

Question

I am using lucene to index my data using java programming language.我正在使用 lucene 使用 java 编程语言来索引我的数据。 But still, when i retrieve the terms that lucene indexed, they appear with tags like html (html is considered as a term not a tag and lucene doesn't remove it).但是，当我检索 lucene 索引的术语时，它们会出现带有 html 之类的标签（html 被视为术语而不是标签，lucene 不会删除它）。 Is there any code or library for example like English analyzer that can remove the desired html tags?是否有任何代码或库，例如可以删除所需的 html 标签的英语分析器？

Answer 1

If you want to remove html tags before indexing them in Lucene, you might use PatternReplaceCharFilter .如果你想在 Lucene 中索引之前删除 html 标签，你可以使用PatternReplaceCharFilter 。 It uses a regular expression for the target of replace string.它使用正则表达式作为替换字符串的目标。

You could create char filter like this:您可以像这样创建 char 过滤器：

CharFilter cf = new PatternReplaceCharFilter(Pattern.compile("<[^>]*>"), "", reader);

this, will replace all html tags with empty string, so it will be removed.这将用空字符串替换所有 html 标记，因此它将被删除。

Lucene 无索引 html css 标签 Z93F725A07423FE1C889F448B33D21F6

问题描述

1 个解决方案

解决方案1
0 2019-10-13 18:19:30

Lucene 无索引 html css 标签 Z93F725A07423FE1C889F448B33D21F6

问题描述

1 个解决方案

解决方案1 0 2019-10-13 18:19:30

解决方案1
0 2019-10-13 18:19:30