简体   繁体   中英

Removing all the characters between two specific tags (java regex)

I want to remove html content and the tags

<DATE> html content </DATE>

These are the different versions of the code I have tried, none of them worked:

content = content.replaceAll("<DATE>(?s:)</DATE>", "");
content = content.replaceAll("<DATE>(?:.|\n)</DATE>", "");
content = content.replaceAll("<DATE>" + Pattern.DOTALL + "</DATE>", "");
content = content.replaceAll("<DATE>(.*?)</DATE>", "");

Any suggestions?

Complete Code:

Path corpusPath = Paths.get(path + file);
String content = new String(Files.readAllBytes(corpusPath), charset);
content = content.replaceAll("<HEADLINE>", "<DOCHDR>");
content = content.replaceAll("</HEADLINE>", "</DOCHDR>");
content = content.replaceAll("<DATE>(.*?)</DATE>", "");
Path destPath = Paths.get(path + "Parsed\\" +file);
Files.write(destPath, content.getBytes(charset));

Try the below regex to remove <DATE> tag along with it's content,

content = content.replaceAll("(?s)<DATE>.*?</DATE>", "");

Explanation:

  • (?s) DOTALL Modifier enables DOTALL mode. It make dot to match even newline characters also.
  • <DATE> Matches the starting <DATE> tag.
  • .*? Matches any character upto the next </DATE> string. ? after * tells the regex engine to does a shortest match.
  • Finally the matched characters are replaced with null string.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM