简体   繁体   English

为除尖括号外的非字符之外的每个单词添加标签

[英]Adding tags to every word except non characters outside angle brackets

I'm working on text paragraph that contains image tags and new line tags. 我正在处理包含图像标记和新行标记的文本段落。 the objective is to make everything nonword charechter been shown clearly by changing all word charachter's color to white. 目标是通过将所有单词charachter的颜色更改为白色来使所有单词非常清晰地显示出来。 I'm using java as programming language. 我正在使用java作为编程语言。 I'm tring to use regular expression but the problem is it changes word charechters inside image tags. 我想使用正则表达式,但问题它改变了图像标签内的单词charechters。

String RegEx = "\\w|[àÀâÂäÄáÁéÉèÈêÊëËìÌîÎïÏòÒôÔöÖùÙûÛüÜçÇ’ñ]";

try {
    Pattern mypattern = Pattern.compile(RegEx, Pattern.CASE_INSENSITIVE);
    Matcher myMatcher = mypattern.matcher(sentence);
    int offset = 0;
    while (myMatcher.find()) {
        int start = myMatcher.start() + offset;
        int end = myMatcher.end() + offset;
        sentence = sentence.substring(0, start) + "<font color=\"white\">" + sentence.substring(start, end) + "</font>" + sentence.substring(end, sentence.length());
        offset += 28;
    }
} catch (Exception e) {
    e.printStackTrace();
}

example of the needed result. 所需结果的例子。 input: Most implementations<img title="hello:" alt="hello:{}" src="http://images.doctissimo.fr/hello.gif" class="wysiwyg_smiley" /> provide ASDF as a module, and you can simply (require "asdf"). 输入: Most implementations<img title="hello:" alt="hello:{}" src="http://images.doctissimo.fr/hello.gif" class="wysiwyg_smiley" /> provide ASDF as a module, and you can simply (require "asdf").

output: 输出:

<font color="white">Most<font> <font color="white">implementations<font><img title="hello:" alt="hello:{}" src="http://images.doctissimo.fr/hello.gif" class="wysiwyg_smiley" /> <font color="white">provide<font> <font color="white">ASDF<font> <font color="white">as<font> <font color="white">a<font> <font color="white">module<font>, <font color="white">and<font> <font color="white">you<font> <font color="white">can<font> <font color="white">simply<font> (<font color="white">require<font> "<font color="white">asdf<font>"). 

NOTA: NOTA:

I hope this discussion will be an help for the casual reader and/or googler and will be "a window of peace" in the war of Regex vs HTML Parser . 我希望这次讨论能够为休闲读者和/或Google员工提供帮助,并将成为Regex与HTML Parser战争中的“和平之窗”。


Solution #1: With Regex 解决方案#1:使用正则表达式

Sample code 示例代码

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class HelloWorld {
    public static void main(String []args){
        String sentence = "Most implementations<img title=\"hello:\" alt=\"hello:{}\" src=\"http://images.doctissimo.fr/hello.gif\" class=\"wysiwyg_smiley\" /> provide ASDF as a module, and you can simply (require \"asdf\").";
        String RegEx = "(?is)(\\w+|[\u00E0\u00C0\u00E2\u00C2\u00E4\u00C4\u00E1\u00C1\u00E9\u00C9\u00E8\u00C8\u00EA\u00CA\u00EB\u00CB\u00EC\u00CC\u00EE\u00CE\u00EF\u00CF\u00F2\u00D2\u00F4\u00D4\u00F6\u00D6\u00F9\u00D9\u00FB\u00DB\u00FC\u00DC\u00E7\u00C7\u2019\u00F1]+)(<[^>]+>)?";

        Pattern mypattern = Pattern.compile(RegEx);

        Matcher myMatcher = mypattern.matcher(sentence);
        String output=myMatcher.replaceAll("<font color=\"white\">$1</font>$2");

        System.out.println(output);
     }
}

Output 产量

<font color="white">Most</font> <font color="white">implementations</font> <img title="hello:" alt="hello:{}" src="http://images.doctissimo.fr/hello.gif" class="wysiwyg_smiley" /> <font color="white">provide</font> <font color="white">ASDF</font> <font color="white">as</font> <font color="white">a</font> <font color="white">module</font>, <font color="white">and</font> <font color="white">you</font> <font color="white">can</font> <font color="white">simply</font> (<font color="white">require</font> "<font color="white">asdf</font>").

Solution #2: With Jsoup 解决方案#2:使用Jsoup

Sample code 示例代码

import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.TextNode;

public class HelloWorldWithJsoup {
    public static void main(String[] args) {
        String sentence = "Most implementations<img title=\"hello:\" alt=\"hello:{}\" src=\"http://images.doctissimo.fr/hello.gif\" class=\"wysiwyg_smiley\" /> provide ASDF as a module, and you can simply (require \"asdf\").";

        Element body = Jsoup.parse(sentence).body();

        for (TextNode textNode : body.textNodes()) {
            textNode.wrap("<font color=\"white\"></font>");
        }

        System.out.println(body.html());
    }
}

Output 产量

<font color="white">Most implementations</font>
<img title="hello:" alt="hello:{}" src="http://images.doctissimo.fr/hello.gif" class="wysiwyg_smiley" />
<font color="white"> provide ASDF as a module, and you can simply (require &quot;asdf&quot;).</font>

Discussion 讨论

Let's compare both approaches: 让我们比较两种方法:

Quantitatively 数量上

Except the imports, both codes share the same lines of code count. 除导入外,两个代码共享相同的代码行数。 Excluding the core classes offered by the JDK and classes instanciated under the cover, Solution#2 needs 3 additional classes ( Jsoup , Element and TextNode ) while Solution#1 needs 2 ( Matcher , Pattern ). 排除JDK提供的核心类和封面下实例化的类,解决方案#2需要3个额外的类( JsoupElementTextNode ),而解决方案#1需要2个( MatcherPattern )。 Solution#2 requires that you put a dependency inside your code while Solution#1 is ready out of box with a JDK. 解决方案#2要求您在代码中放置一个依赖项,而解决方案#1已准备好与JDK一起开箱即用。

Qualitatively 定性

From the readability point of view, they are both straight forward. 从可读性的角度来看,它们都是直截了当的。 However for a non seasoned Java regex API reader, it may be challenging to understand the code. 但是对于非经验丰富的Java正则表达式API阅读器,理解代码可能具有挑战性。 From a maintenability point of view, the regex used here is quite long and you need unicode capabilities. 从可维护性的角度来看,这里使用的正则表达式很长,您需要unicode功能。 The Jsoup solution relies only on well documented methods. Jsoup解决方案仅依赖于记录良好的方法。 Finally, the output produced by Jsoup is more respectful of HTML good practices. 最后,Jsoup产生的输出更加尊重HTML良好实践。 Less font tags are used. 使用较少的font标记。

Comparison matrix 比较矩阵

Quantitatively:     |  Regex vs Jsoup
--------------------------------------
Lines of code       |    O        O
Classes used        |    O        X
Dependency required |    O        X


Qualitatively:      |  Regex vs Jsoup
--------------------------------------
Readability         |    O        O
Maintenability      |    X        O
HTML good practices |    X        O

As you can see, the battle ends up with a draw . 正如你所看到的,战斗结束了平局

Conclusion 结论

IMO, in this use case, choosing between one solution or another will greatly depend on the produced result by each solution AND the expected result. IMO,在这个用例中,在一个或另一个解决方案之间进行选择将在很大程度上取决于每个解决方案的生成结果和预期结果。 The Jsoup solution draws characters like , or ) in white. 该Jsoup溶液绘制字符等,)为白色。 The regex approach doesn't. 正则表达式方法没有。 For the final user, which output is desired will lead towards one solution or another. 对于最终用户,期望哪个输出将导致一个或另一个解决方案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM