简体   繁体   中英

Regex XML tags having angle brackets inside

I need a regex which will give me one XML tag eg <ABC/> or <ABC></ABC>

So, here if I use <(.)+?> , it will give me <ABC> or <ABC> or </ABC> . This is fine.

Now, the problem:

I have one XML as

<VALUE ABC="10000" PQR="12422700" ADJ="" PROD_TYPE="COCOG EFI LWL P&amp;C >1Y-5Y" SRC="BASE" DATA="data" ACTION="INSERT" ID="100000" GRC_PROD=""/>

Here, if you see, PROD_TYPE="COCOG EFI LWL P&amp;C >1Y-5Y" has a greater than symbol in the value of an attribute.

So, the regex returns me

<VALUE ABC="10000" PQR="12422700" ADJ="" PROD_TYPE="COCOG EFI LWL P&amp;C >

instead of complete

<VALUE ABC="10000" PQR="12422700" ADJ="" PROD_TYPE="COCOG EFI LWL P&amp;C >1Y-5Y" SRC="BASE" DATA="data" ACTION="INSERT" ID="100000" GRC_PROD=""/>

I need some regex which will not consider the less than and greater than symbols which are part of value ie enclosed in double quotes.

You may try this:

(?i)<[a-z][\w:-]+(?: [a-z][\w:-]+="[^"]*")*/?>

And the explanation goes here below:

(?i)         # Match the remainder of the regex with the options: case insensitive (i)
<            # Match the character “<” literally
[a-z]        # Match a single character in the range between “a” and “z”
[\\w:-]       # Match a single character present in the list below
                # A word character (letters, digits, and underscores)
                # The character “:”
                # The character “-”
   +            # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
(?:          # Match the regular expression below
   \\            # Match the character “ ” literally
   [a-z]        # Match a single character in the range between “a” and “z”
   [\\w:-]       # Match a single character present in the list below
                   # A word character (letters, digits, and underscores)
                   # The character “:”
                   # The character “-”
      +            # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
   =\"           # Match the characters “=\"” literally
   [^\"]         # Match any character that is NOT a “\"”
      *            # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
   \"            # Match the character “\"” literally
)*           # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
/            # Match the character “/” literally
   ?            # Between zero and one times, as many times as possible, giving back as needed (greedy)
>            # Match the character “>” literally

And if you like like to include open , close or self-closed tags then try below RegEx :

(?i)(?:<([a-z][\w:-]+)(?: [a-z][\w:-]+="[^"]*")*>.+?</\1>|<([a-z][\w:-]+)(?: [a-z][\w:-]+="[^"]*")*/>)

A java code frag implementing the same:

try {
    boolean foundMatch = subjectString.matches("(?i)(?:<([a-z][\\w:-]+)(?: [a-z][\\w:-]+=\"[^\"]*\")*>.+?</\\1>|<([a-z][\\w:-]+)(?: [a-z][\\w:-]+=\"[^\"]*\")*/>)");
} catch (PatternSyntaxException ex) {
    // Syntax error in the regular expression
}

Hope this helps...

To expand on the point of G_H's link: Don't use regex to parse XML. Use XPath to return a Node, and pass that Node to an identity Transformer :

Node valueElement = (Node)
    XPathFactory.newInstance().newXPath().evaluate("//VALUE",
        new InputSource(new StringReader(xmlDocument)),
        XPathConstants.NODE);

StringWriter result = new StringWriter();
TransformerFactory.newInstance().newTransformer().transform(
    new DOMSource(valueElement), new StreamResult(result));

String valueElementMarkup = result.toString();

Also try this:

<.*?(".*?".*?)*?>

It grabs everything between < and > only if even number of " double quotes are present. Pairs of double quotes mean that stuff is enclosed in. Otherwise it skips > symbol and keep searching further for the next one > (which should be happen after closing " quote)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM