I need a regex which will give me one XML tag eg <ABC/>
or <ABC></ABC>
So, here if I use <(.)+?>
, it will give me <ABC>
or <ABC>
or </ABC>
. This is fine.
Now, the problem:
I have one XML as
<VALUE ABC="10000" PQR="12422700" ADJ="" PROD_TYPE="COCOG EFI LWL P&C >1Y-5Y" SRC="BASE" DATA="data" ACTION="INSERT" ID="100000" GRC_PROD=""/>
Here, if you see, PROD_TYPE="COCOG EFI LWL P&C >1Y-5Y"
has a greater than symbol in the value of an attribute.
So, the regex returns me
<VALUE ABC="10000" PQR="12422700" ADJ="" PROD_TYPE="COCOG EFI LWL P&C >
instead of complete
<VALUE ABC="10000" PQR="12422700" ADJ="" PROD_TYPE="COCOG EFI LWL P&C >1Y-5Y" SRC="BASE" DATA="data" ACTION="INSERT" ID="100000" GRC_PROD=""/>
I need some regex which will not consider the less than and greater than symbols which are part of value ie enclosed in double quotes.
You may try this:
(?i)<[a-z][\w:-]+(?: [a-z][\w:-]+="[^"]*")*/?>
And the explanation goes here below:
(?i) # Match the remainder of the regex with the options: case insensitive (i)
< # Match the character “<” literally
[a-z] # Match a single character in the range between “a” and “z”
[\\w:-] # Match a single character present in the list below
# A word character (letters, digits, and underscores)
# The character “:”
# The character “-”
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
(?: # Match the regular expression below
\\ # Match the character “ ” literally
[a-z] # Match a single character in the range between “a” and “z”
[\\w:-] # Match a single character present in the list below
# A word character (letters, digits, and underscores)
# The character “:”
# The character “-”
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
=\" # Match the characters “=\"” literally
[^\"] # Match any character that is NOT a “\"”
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\" # Match the character “\"” literally
)* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
/ # Match the character “/” literally
? # Between zero and one times, as many times as possible, giving back as needed (greedy)
> # Match the character “>” literally
And if you like like to include open
, close
or self-closed
tags then try below RegEx
:
(?i)(?:<([a-z][\w:-]+)(?: [a-z][\w:-]+="[^"]*")*>.+?</\1>|<([a-z][\w:-]+)(?: [a-z][\w:-]+="[^"]*")*/>)
A java
code frag implementing the same:
try {
boolean foundMatch = subjectString.matches("(?i)(?:<([a-z][\\w:-]+)(?: [a-z][\\w:-]+=\"[^\"]*\")*>.+?</\\1>|<([a-z][\\w:-]+)(?: [a-z][\\w:-]+=\"[^\"]*\")*/>)");
} catch (PatternSyntaxException ex) {
// Syntax error in the regular expression
}
Hope this helps...
To expand on the point of G_H's link: Don't use regex to parse XML. Use XPath to return a Node, and pass that Node to an identity Transformer :
Node valueElement = (Node)
XPathFactory.newInstance().newXPath().evaluate("//VALUE",
new InputSource(new StringReader(xmlDocument)),
XPathConstants.NODE);
StringWriter result = new StringWriter();
TransformerFactory.newInstance().newTransformer().transform(
new DOMSource(valueElement), new StreamResult(result));
String valueElementMarkup = result.toString();
Also try this:
<.*?(".*?".*?)*?>
It grabs everything between <
and >
only if even number of "
double quotes are present. Pairs of double quotes mean that stuff is enclosed in. Otherwise it skips >
symbol and keep searching further for the next one >
(which should be happen after closing "
quote)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.