简体   繁体   English

Java正则表达式匹配除一个特殊情况之外的所有html元素

[英]Java regex to match all html elements except one special case

I have a string with some markup which looks like this: 我有一个带有一些标记的字符串,如下所示:

The quick brown <a href="www.fox.org">fox</a> jumped over the lazy <a href="entry://id=6000009">dog</a> <img src="dog.png" />.

I'm trying to strip away everything except the anchor elements with "entry://id=" inside. 我试图去除除了锚点元素之外的所有内容,其中包含“entry:// id =”。 Thus the desired output from the above example would be: 因此,上述示例的所需输出将是:

The quick brown fox jumped over the lazy <a href="entry://id=6000009">dog</a>.

Writing this match, the closest I've come so far is: 写这场比赛,我到目前为止最接近的是:

<.*?>!<a href=\\"entry://id=\\\\d+\\">.*?<\\\\/a>

But I can't figure out why this doesn't work. 但我无法弄清楚为什么这不起作用。 Any help (apart from the "why don't you use a parser" :) would be greatly appreciated! 任何帮助(除了“为什么你不使用解析器”:)将不胜感激!

I would really not use regexps for parsing HTML. 我真的不会使用正则表达式来解析HTML。 HTML isn't regular and there are no end of edge cases to trip you up. HTML并不是常规的,并且没有结束边缘情况会让你失望。

Check out JTidy instead. 请查看JTidy

Not easily possible with regex. 正则表达式不容易实现。 I recommend a parser that understands the semantics of HTML/XML. 我推荐一个理解HTML / XML语义的解析器。

If you insist , you could do a multi-step approach, something like: 如果你坚持 ,你可以做一个多步骤的方法,如:

  • Replace "<(a\\s*href="entry:.*?/a)>" with "{{{{\\1}}}}" "<(a\\s*href="entry:.*?/a)>"替换为"{{{{\\1}}}}"
  • Replace "<(?!/a}}}})[^>]*>" with "" "<(?!/a}}}})[^>]*>"替换为""
  • Replace "{{{{" with "<" "{{{{"替换为"<"
  • Replace "}}}}" with ">" "}}}}"替换为">"

Be warned that the above is error-prone and will fail at some point. 请注意,上述内容容易出错,并且会在某些时候失败。 Consider it an ugly hack, not a real solution. 认为它是一个丑陋的黑客,而不是一个真正的解决方案。 Something like the above is okay for a one-off edit of some text file in a regex-aware text editor, but for repeated, real-world use as part of data processing in an app - not so much. 像上面这样的东西可以在一个正则表达式的文本编辑器中对一些文本文件进行一次性编辑,但是对于在应用程序中作为数据处理的一部分重复,真实地使用 - 不是那么多。

Using this : 使用这个:

((<a href="entry://id=\d+">.*?</a>)|<!\[CDATA\[.*?\]\]>|<!--.*?-->|<.*?>)

and combining it with a replace all $2 would work for your example. 并将它与替换所有$ 2相结合将适用于您的示例。 The code below proves it: 下面的代码证明了这一点:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

import static org.junit.Assert.*;
import org.junit.Test;


public class TestStack1305864 {

    @Test
    public void matcherWithCdataAndComments(){
        String s="The quick <span>brown</span> <a href=\"www.fox.org\">fox</a> jumped over the lazy <![CDATA[ > ]]> <a href=\"entry://id=6000009\">dog</a> <img src=\"dog.png\" />.";
        String r="The quick brown fox jumped over the lazy <a href=\"entry://id=6000009\">dog</a> .";
        String pattern="((<a href=\"entry://id=\\d+\">.*?</a>)|<!\\[CDATA\\[.*?\\]\\]>|<!--.*?-->|<.*?>)";
        Pattern p = Pattern.compile(pattern);
        Matcher m = p.matcher(s);

        String t = s.replaceAll(pattern, "$2");
        System.out.println(t);
        System.out.println(r);
        assertEquals(r, t);
    }
}

The idea is to capture all the elements you are interested to keep in a specific group so you can insert them back in the string. 我们的想法是捕获您有兴趣保留在特定组中的所有元素,以便将它们插回到字符串中。
This way you can replace all : 这样你可以替换所有:
For every element which doesn't match the interesting ones the group will be empty and the element will be replaced with "" 对于与有趣的元素不匹配的每个元素,该组将为空,并且元素将替换为“”
For the interesting elements the group will not be empty and will be appended to the result String. 对于有趣的元素,该组不会为空,并将附加到结果String。

edit: handle nested < or > in CDATA and comments 编辑:处理CDATA中的嵌套<或>和注释
edit: see http://martinfowler.com/bliki/ComposedRegex.html for a regex composition pattern, designed to make regex more readable for maintenance purposes. 编辑:请参阅http://martinfowler.com/bliki/ComposedRegex.html获取正则表达式组合模式,旨在使正则表达式更易于维护。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM