Using regex, I want to be able to get the text between multiple html tags. Here HTML is just for representation of input, I am not worried about HTML tags, just want to retrieve the content in the HTML tags(between both correct open and close tags). For instance, the following:
Required Input:
<h1>Text 1</h1>
<h1><h2>Text 2</h2></h1>
<h1><h2>Text 3</h2>Xtra</h1>
<h1>Text 4<h1>extra</h1515></h1>
<h1><h1></h1></h1>
Required Output:
Text 1
Text 2
Text 3
None
None
Output Obtained:
Text 1
Text 2
Text 3
Text 4<h1>extra</h1515>
<h1></h1>
Regex I tried:
"<([\\S ]+)>([\\S ]+)</\\1>"
I am not getting the expected result.
My java code:
import java.io.*;
import java.util.*;
import java.text.*;
import java.math.*;
import java.util.regex.*;
public class Solution{
public static void main(String[] args){
Scanner in = new Scanner(System.in);
int testCases = Integer.parseInt(in.nextLine());
while(testCases>0){
String line = in.nextLine();
String tmp = line;
Pattern r = Pattern.compile("<([\\S ]+)>([\\S ]+)</\\1>", Pattern.MULTILINE);
Matcher m = r.matcher(line);
while(m.find()){
line = line.replaceAll(line, m.group(2));
m = r.matcher(line);
}
if(line != tmp)
System.out.println(line);
else
System.out.println("None");
testCases--;
}
}
}
As pointed out in the comments that way lies nothing but pain. For what your attempting to do you would be far better off walking the DOM (Document Object Model) with something like jsoup
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.