Is there an existing Java library which provides a method to strip all HTML tags from a String? I'm looking for something equivalent to the strip_tags
function in PHP.
I know that I can use a regex as described in this Stackoverflow question , however I was curious if there may already be a stripTags()
method floating around somewhere in the Apache Commons library that can be used.
Use JSoup , it's well documented, available on Maven and after a day of spending time with several libraries, for me, it is the best one i can imagine.. My own opinion is, that a job like that, parsing html into plain-text, should be possible in one line of code -> otherwise the library has failed somehow... just saying ^^ So here it is, the one-liner of JSoup - in Markdown4J, something like that is not possible, in Markdownj too, in htmlCleaner this is pain in the ass with somewhat about 50 lines of code...
String plain = new HtmlToPlainText().getPlainText(Jsoup.parse(html));
And what you got is real plain-text (not just the html-source-code as a String, like in other libs lol) -> he really does a great job on that. It is more or less the same quality as Markdownify for PHP....
This is what I found on google on it. For me it worked fine.
String noHTMLString = htmlString.replaceAll("\\<.*?\\>", "");
Whatever you do, make sure you normalize the data before you start trying to strip tags. I recently attended a web app security workshop that covered XSS filter evasion. One would normally think that searching for <
or <
or its hex equivalent would be sufficient. I was blown away after seeing a slide with 70 ways that <
can be encoded to beat filters.
Update:
Below is the presentation I was referring to, see slide 26 for the 70 ways to encode <
.
There may be some, but the most robust thing is to use an actual HTML parser. There's one here , and if it's reasonably well formed, you can also use SAX or another XML parser.
After having this question open for almost a week, I can say with some certainty that there is no method available in the Java API or Apache libaries which strips HTML tags from a String. You would either have to use an HTML parser as described in the previous answers, or write a simple regular expression to strip out the tags.
When using Jsoup it's even easier than described in above answers:
String html = "bla <b>hehe</b> <br> this is awesome simple";
String text = Jsoup.parse(html).text();
I've used nekoHtml to do that. It can strip all tags but it can just as easily keep or strip a subset of tags.
I know that this question is quite old, but I have been looking for this too and it seems that it is still not easy to find a good and easy solution in java.
Today I came across this little functions lib. It actually attempts to imitate the php strip_tags
function.
http://jmelo.lyncode.com/java-strip_tags-php-function/
It works like this (copied from their site):
import static com.lyncode.jtwig.functions.util.HtmlUtils.stripTags;
public class StripTagsExample {
public static void main(String... args) {
String result = stripTags("<!-- <a href='test'></a>--><a>Test</a>", "");
// Produced result: Test
}
}
Hi I know this thread is old but it still came out tops on Google, and I was looking for a quick fix to the same problem. Couldn't find anything useful so I came up with this code snippet -- hope it helps someone. It just loops over the string and skips all the tags. Plain & simple.
boolean intag = false;
String inp = "<H1>Some <b>HTML</b> <span style=blablabla>text</span>";
String outp = "";
for (int i=0; i < inp.length(); ++i)
{
if (!intag && inp.charAt(i) == '<')
{
intag = true;
continue;
}
if (intag && inp.charAt(i) == '>')
{
intag = false;
continue;
}
if (!intag)
{
outp = outp + inp.charAt(i);
}
}
return outp;
With pure iterative approach and no regex :
public String stripTags(final String html) {
final StringBuilder sbText = new StringBuilder(1000);
final StringBuilder sbHtml = new StringBuilder(1000);
boolean isText = true;
for (char ch : html.toCharArray()) {
if (isText) { // outside html
if (ch != '<') {
sbText.append(ch);
continue;
} else { // switch mode
isText = false;
sbHtml.append(ch);
continue;
}
}else { // inside html
if (ch != '>') {
sbHtml.append(ch);
continue;
} else { // switch mode
isText = true;
sbHtml.append(ch);
continue;
}
}
}
return sbText.toString();
}
Because of abbreviation (string truncation) of html fragment, I had also the problem of unclosed html tags that regex can't detect. Eg:
Lorem ipsum dolor sit amet, <b>consectetur</b> adipiscing elit. <a href="abc"
So, referring to the 2 best answers (JSoup and regex), I preferred solution using JSoup:
Jsoup.parse(html).text()
Wicket uses the following method to escape html, located in: org.apache.wicket.util.string.Strings
public static CharSequence escapeMarkup(final String s, final boolean escapeSpaces,
final boolean convertToHtmlUnicodeEscapes)
{
if (s == null)
{
return null;
}
else
{
int len = s.length();
final AppendingStringBuffer buffer = new AppendingStringBuffer((int)(len * 1.1));
for (int i = 0; i < len; i++)
{
final char c = s.charAt(i);
switch (c)
{
case '\t' :
if (escapeSpaces)
{
// Assumption is four space tabs (sorry, but that's
// just how it is!)
buffer.append(" ");
}
else
{
buffer.append(c);
}
break;
case ' ' :
if (escapeSpaces)
{
buffer.append(" ");
}
else
{
buffer.append(c);
}
break;
case '<' :
buffer.append("<");
break;
case '>' :
buffer.append(">");
break;
case '&' :
buffer.append("&");
break;
case '"' :
buffer.append(""");
break;
case '\'' :
buffer.append("'");
break;
default :
if (convertToHtmlUnicodeEscapes)
{
int ci = 0xffff & c;
if (ci < 160)
{
// nothing special only 7 Bit
buffer.append(c);
}
else
{
// Not 7 Bit use the unicode system
buffer.append("&#");
buffer.append(new Integer(ci).toString());
buffer.append(';');
}
}
else
{
buffer.append(c);
}
break;
}
}
return buffer;
}
}
public static String stripTags(String str) {
int startPosition = str.indexOf('<');
int endPosition;
while (startPosition != -1) {
endPosition = str.indexOf('>', startPosition);
str = str.substring(0, startPosition) + (endPosition != -1 ? str.substring(endPosition + 1) : "");
startPosition = str.indexOf('<');
}
return str;
}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.