I want to remove html content and the tags
<DATE> html content </DATE>
These are the different versions of the code I have tried, none of them worked:
content = content.replaceAll("<DATE>(?s:)</DATE>", "");
content = content.replaceAll("<DATE>(?:.|\n)</DATE>", "");
content = content.replaceAll("<DATE>" + Pattern.DOTALL + "</DATE>", "");
content = content.replaceAll("<DATE>(.*?)</DATE>", "");
Any suggestions?
Complete Code:
Path corpusPath = Paths.get(path + file);
String content = new String(Files.readAllBytes(corpusPath), charset);
content = content.replaceAll("<HEADLINE>", "<DOCHDR>");
content = content.replaceAll("</HEADLINE>", "</DOCHDR>");
content = content.replaceAll("<DATE>(.*?)</DATE>", "");
Path destPath = Paths.get(path + "Parsed\\" +file);
Files.write(destPath, content.getBytes(charset));
Try the below regex to remove <DATE>
tag along with it's content,
content = content.replaceAll("(?s)<DATE>.*?</DATE>", "");
Explanation:
(?s)
DOTALL Modifier enables DOTALL mode. It make dot to match even newline characters also. <DATE>
Matches the starting <DATE>
tag. .*?
Matches any character upto the next </DATE>
string. ?
after *
tells the regex engine to does a shortest match.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.