简体   繁体   English

Java Regex xml解析

[英]Java Regex xml parsing

I'm trying to find a tag from begin to end in xml and replace it with a blank. 我试图在xml中从头到尾找到一个标签,并用空格替换它。 A sample xml is like this 一个示例xml是这样的

<lins>
  <lin index="1"> ...<feature>Something</feature>... </lin>
  <lin index="2">...<feature>Something</feature>... </lin>
  <lin index="3">...<feature>Something</feature>....</lin>

  <lin index="1">...<feature>Icom</feature>... </lin>
  <lin index="2">...<feature>Icom</feature>... </lin>
<lins>

I need to remove <lin> to </lin> when ever I see Icom in between 当我看到Icom介于两者之间时,我需要删除<lin></lin>

<lin\\s(.+?Icom.+?)+</lin> is removing all lin items since it matches the first begin <lin> tag and the last lin end tag. <lin\\s(.+?Icom.+?)+</lin>删除所有lin项,因为它匹配第一个begin <lin>标记和最后一个lin结束标记。 Greatly appreciated if you can suggest a way to do this. 非常感谢,如果你能提出一个方法来做到这一点。 Also I can not use xml parsers in my situation. 我也不能在我的情况下使用xml解析器。

String result = subject.replaceAll("(?s)<lin\\b(?:(?!</lin).)*Icom(?:(?!</lin).)*</lin>", "");

should do this, unless you have <lin> tags nested into each other (or inside comments/strings). 应该这样做,除非你有<lin>标签互相嵌套(或在注释/字符串内)。

Explanation: 说明:

<lin\b              # Match <lin (but not link or linen)
(?:                 # Match...
 (?!</lin)          # as long as we're not at a closing tag
 .                  # any character
)*                  # any number of times.
Icom                # Match Icom
(?:(?!</lin).)*     # (as above:) Match any character except closing tag
</lin>              # Match closing tag

you cant do it with regexp. 你不能用正则表达式做到这一点。
For this example: 对于这个例子:

<tag>
    <tag> something </tag>
</tag>

<tag>
</tag>

If you use "<tag>(.*)</tag>" regexp, your group will be this: 如果您使用"<tag>(.*)</tag>"表达式,您的论坛将是:

    <tag> something </tag>
</tag>

<tag>

and if you use "<tag>(.*?)</tag>" regexp, your group will be this: 如果您使用"<tag>(.*?)</tag>"表达式,您的论坛将是:

    <tag> something

You should use something like stack to get the ending of started tag. 你应该使用类似堆栈的东西来获得开始标记的结尾。

I think you need to add more groups to the regexp. 我认为你需要在正则表达式中添加更多组。

Add a group for the precondition to start checking for ex ( 添加一个组作为前提条件以开始检查ex(

Then a group for the stuff inbetween, a group for Icom etc. 然后是一组用于中间的东西,一组用于Icom等。

So off the top of my head my RegEx would look like: 因此,我的RegEx看起来像是:

(<lin\ index\=)(\w+Icom\w+)(\<\/lin>)

Note the escaping might be slightly off, but in essence you need more groups and some less eager matchers. 请注意,转义可能稍微偏离,但实质上您需要更多的组和一些不那么热切的匹配器。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM