简体   繁体   中英

Regex to get text from html tags (nested) - Java

Using regex, I want to be able to get the text between multiple html tags. Here HTML is just for representation of input, I am not worried about HTML tags, just want to retrieve the content in the HTML tags(between both correct open and close tags). For instance, the following:

Required Input:

<h1>Text 1</h1>
<h1><h2>Text 2</h2></h1>
<h1><h2>Text 3</h2>Xtra</h1>
<h1>Text 4<h1>extra</h1515></h1>
<h1><h1></h1></h1>

Required Output:

Text 1
Text 2
Text 3
None
None

Output Obtained:

Text 1
Text 2
Text 3
Text 4<h1>extra</h1515>
<h1></h1>

Regex I tried:

"<([\\S ]+)>([\\S ]+)</\\1>"

I am not getting the expected result.

My java code:

import java.io.*;
import java.util.*;
import java.text.*;
import java.math.*;
import java.util.regex.*;

public class Solution{
   public static void main(String[] args){

      Scanner in = new Scanner(System.in);
      int testCases = Integer.parseInt(in.nextLine());
      while(testCases>0){
         String line = in.nextLine();
         String tmp = line;
          Pattern r = Pattern.compile("<([\\S ]+)>([\\S ]+)</\\1>", Pattern.MULTILINE);
         Matcher m = r.matcher(line);
         while(m.find()){
             line = line.replaceAll(line, m.group(2));
             m = r.matcher(line);
         }
         if(line != tmp)
             System.out.println(line);
          else
              System.out.println("None");
         testCases--;
      }
   }
}

As pointed out in the comments that way lies nothing but pain. For what your attempting to do you would be far better off walking the DOM (Document Object Model) with something like jsoup

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM