简体   繁体   中英

Extracting tags and text between tags using regex on an string with XML tags

I am trying to extract both the tag and the text between the tags in a text file. I am trying to achieve this using regex (Not many xml tags are there).

below is what I have tried so far

     String txt="<DATE>December</DATE>";

        String re1="(<[^>]+>)"; // Tag 1
        String re2="(.*?)"; // Variable Name 1
        String re3="(<[^>]+>)"; // Tag 2

        Pattern p = Pattern.compile(re1+re2+re3,Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
        Matcher m = p.matcher(txt);
        if (m.find())
        {
            String tag1=m.group(1);
            String var1=m.group(2);
            String tag2=m.group(3);
            //System.out.print("("+tag1.toString()+")"+"("+var1.toString()+")"+"("+tag2.toString()+")"+"\n");

            System.out.println(tag1.toString().replaceAll("<>", ""));
            System.out.println(var1.toString());
        }

As an answer, I get:

<DATE>
December

How do I get rid of the <> ?

Don't use regex to parse markup syntax, such as XML, HTML, XHTML and so on.

Many reasons are shown here.

Instead, do yourself a favor and use XPath and XQuery .

It is a bad idea to use regex to parse xml. Using a regex there is no way of identifying a complete element from opening to closing tag (a regex cannot "remember" a number of occurances).

However why your regex fails in this specific case:

In re1 , re2 , re3 you choose the capturing group to include < and > (also you do not include the / in re3 ). You could simply change this

String re1="<([^>]+)>"; // Tag 1
String re2="([^<]*)"; // Variable Name 1
String re3="</([^>]+)>"; // Tag 2

or use a suitable regex to remove < and > form tag1 :

System.out.println(tag1.toString().replaceAll("<|>", ""));

or

System.out.println(tag1.toString().replaceAll("[<>]", ""));

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM