简体   繁体   English

Java正则表达式提取标签之间的内容

[英]Java regular expression to extract content between tags

Input : 输入:

<tag>Testing different formatting options in </tag><tag class="classA classB">Text</tag><tag class="classC">Class C text</tag>

Expected Output : 预期产量:

<tag>Testing different formatting options in </tag><tagA><tabB>Text</tagA></tagB><tagC>Class C text</tag>

Basically the tag is replaced by tags based on the attributes in "class". 基本上,标签将根据“类”中的属性由标签替换。 ie., if the attributes has a classA attribute then the tag will be replaced by tagA, if classB attribute is also present then the tag will also include tagB and so on.. 即,如果属性具有classA属性,则标签将被tagA替换;如果classB属性也存在,则标签也将包含tagB,依此类推。

Attempt made : 尝试了:

    final String TAG_GROUPS = "<tag class=\"(.*)\">(.*)</tag>";
    Pattern pattern = Pattern.compile(TAG_GROUPS);
    Matcher matcher = pattern.matcher(inputString);

The output I am getting fails to find the matching tags. 我得到的输出无法找到匹配的标签。 In particular the statement 特别是声明

    String classes = matcher.group(1);

gives the string classA classB">Text</tag><tag class="classC">Class C text</tag . The pattern matcher is failing to find the matching tags. I am a beginner to regular expressions and I would like to know the right pattern for the problem. Any help is appreciated. 给出字符串classA classB">Text</tag><tag class="classC">Class C text</tag 。模式匹配器无法找到匹配的标签。我是正则表达式的初学者,我想知道解决问题的正确方式,我们将为您提供任何帮助。

You should use greedy regular expression: "<tag class=\\"(.*?)\\">(.*)</tag>" . 您应该使用贪婪的正则表达式: "<tag class=\\"(.*?)\\">(.*)</tag>" Otherwise .* matches any characters including </tag> . 否则.*匹配任何字符,包括</tag>

But generally I agree with guys that this is not the best practice to parse XML using regular expressions. 但是总的来说,我同意这些观点,这不是使用正则表达式解析XML的最佳实践。 Use XML parser instead. 请改用XML解析器。

While you could use regexp to locate the start tags and parse the classes, there is no way to produce nested tags as output. 尽管可以使用regexp定位开始标签并解析类,但是无法将嵌套标签作为输出产生。 See this answer for details. 有关详细信息,请参见此答案

What you could do is write your own simple HTML parser but HTML is pretty messy to parse. 您可以做的是编写自己的简单HTML解析器,但是HTML解析起来很混乱。 Or to put it another way: Have a look at my reputation and then consider that I wouldn't try it without a really good reason (like someone paying me half a million dollars). 或者换一种说法:看一下我的声誉,然后考虑如果没有充分的理由(比如有人付给我500万美元), 就不会尝试。

Use a real HTML parser like HTML Tidy instead. 改用真正的HTML解析器,例如HTML Tidy

When you use * it will try to absorb all possible characters (greedy). 当您使用* ,它将尝试吸收所有可能的字符(贪婪)。

If you want that .* to match the less possible characters you must use lazy match with *? 如果您希望.*匹配较少的字符,则必须对*?使用延迟匹配*? .

So your regex get as: 所以您的正则表达式为:

<tag class=\"(.*?)\">(.*?)</tag>

Above, is the easy way. 上面,是简单的方法。 But isn't necessary the optimum way. 但这不是最佳方法。 Lazy match is more slow than greedy and if you can, you must try to avoid it. 懒惰匹配比贪婪更慢,如果可以,则必须尽量避免。 For example if you estimate that you code will be correct (not tag broken without a close tag, etc) is better that you use negate classes instead of .*? 例如,如果您估计自己的代码是正确的(没有关闭标签就不会使标签破损等),最好使用求反类而不是.*? . For example, you regex can be write as: 例如,您的正则表达式可以写为:

<tag class="([^"]*)">([^<]*)</tag>

Witch is more efficient for the regex engine (although is not always possible to convert lazy match to negate class). Witch对于正则表达式引擎更有效(尽管并非总是可能将惰性匹配转换为否定类)。

And of course, if you are trying to parse a complete html or xml document in witch you must do many different changes, it's better to use a xml (html) parser. 当然,如果您尝试用巫婆解析完整的html或xml文档,则必须进行许多不同的更改,最好使用xml(html)解析器。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM