I am trying to extract both the tag and the text between the tags in a text file. I am trying to achieve this using regex (Not many xml tags are there).
below is what I have tried so far
String txt="<DATE>December</DATE>";
String re1="(<[^>]+>)"; // Tag 1
String re2="(.*?)"; // Variable Name 1
String re3="(<[^>]+>)"; // Tag 2
Pattern p = Pattern.compile(re1+re2+re3,Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Matcher m = p.matcher(txt);
if (m.find())
{
String tag1=m.group(1);
String var1=m.group(2);
String tag2=m.group(3);
//System.out.print("("+tag1.toString()+")"+"("+var1.toString()+")"+"("+tag2.toString()+")"+"\n");
System.out.println(tag1.toString().replaceAll("<>", ""));
System.out.println(var1.toString());
}
As an answer, I get:
<DATE>
December
How do I get rid of the <>
?
Don't use regex to parse markup syntax, such as XML, HTML, XHTML and so on.
It is a bad idea to use regex to parse xml. Using a regex there is no way of identifying a complete element from opening to closing tag (a regex cannot "remember" a number of occurances).
However why your regex fails in this specific case:
In re1
, re2
, re3
you choose the capturing group to include <
and >
(also you do not include the /
in re3
). You could simply change this
String re1="<([^>]+)>"; // Tag 1
String re2="([^<]*)"; // Variable Name 1
String re3="</([^>]+)>"; // Tag 2
or use a suitable regex to remove <
and >
form tag1
:
System.out.println(tag1.toString().replaceAll("<|>", ""));
or
System.out.println(tag1.toString().replaceAll("[<>]", ""));
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.