Regex to get text from html tags (nested) - Java

Question

Using regex, I want to be able to get the text between multiple html tags. Here HTML is just for representation of input, I am not worried about HTML tags, just want to retrieve the content in the HTML tags(between both correct open and close tags). For instance, the following:

Required Input:

<h1>Text 1</h1>
<h1><h2>Text 2</h2></h1>
<h1><h2>Text 3</h2>Xtra</h1>
<h1>Text 4<h1>extra</h1515></h1>
<h1><h1></h1></h1>

Required Output:

Text 1
Text 2
Text 3
None
None

Output Obtained:

Text 1
Text 2
Text 3
Text 4<h1>extra</h1515>
<h1></h1>

Regex I tried:

"<([\\S ]+)>([\\S ]+)</\\1>"

I am not getting the expected result.

My java code:

import java.io.*;
import java.util.*;
import java.text.*;
import java.math.*;
import java.util.regex.*;

public class Solution{
   public static void main(String[] args){

      Scanner in = new Scanner(System.in);
      int testCases = Integer.parseInt(in.nextLine());
      while(testCases>0){
         String line = in.nextLine();
         String tmp = line;
          Pattern r = Pattern.compile("<([\\S ]+)>([\\S ]+)</\\1>", Pattern.MULTILINE);
         Matcher m = r.matcher(line);
         while(m.find()){
             line = line.replaceAll(line, m.group(2));
             m = r.matcher(line);
         }
         if(line != tmp)
             System.out.println(line);
          else
              System.out.println("None");
         testCases--;
      }
   }
}

Answer 1

As pointed out in the comments that way lies nothing but pain. For what your attempting to do you would be far better off walking the DOM (Document Object Model) with something like jsoup

Regex to get text from html tags (nested) - Java

Question

1 answers

solution1
2 2016-01-02 23:38:42

Regex to get text from html tags (nested) - Java

Question

1 answers

solution1 2 2016-01-02 23:38:42

solution1
2 2016-01-02 23:38:42