在Java中剥离HTML标记

Question

Is there an existing Java library which provides a method to strip all HTML tags from a String? 是否有现有的Java库提供了从String中剥离所有HTML标记的方法？ I'm looking for something equivalent to the strip_tags function in PHP. 我正在寻找与PHP中的strip_tags函数等效的东西。

I know that I can use a regex as described in this Stackoverflow question , however I was curious if there may already be a stripTags() method floating around somewhere in the Apache Commons library that can be used. 我知道我可以使用这个Stackoverflow问题中描述的正则表达式，但是我很好奇是否已经有一个stripTags()方法浮动在Apache Commons库中可以使用的某个地方。

Answer 1

Use JSoup , it's well documented, available on Maven and after a day of spending time with several libraries, for me, it is the best one i can imagine.. My own opinion is, that a job like that, parsing html into plain-text, should be possible in one line of code -> otherwise the library has failed somehow... just saying ^^ So here it is, the one-liner of JSoup - in Markdown4J, something like that is not possible, in Markdownj too, in htmlCleaner this is pain in the ass with somewhat about 50 lines of code... 使用JSoup ，它有很好的文档记录，可以在Maven上使用，经过一天的花费时间与几个库，对我来说，它是我能想象的最好的..我自己的意见是，这样的工作，解析html到plain-文本，应该可以在一行代码 - >否则库已经以某种方式失败...只是说^^所以这里是，JSoup的单行 - 在Markdown4J，类似的东西是不可能的，在Markdownj ，在htmlCleaner中，这有点大约50行代码的痛苦...

String plain = new HtmlToPlainText().getPlainText(Jsoup.parse(html));

And what you got is real plain-text (not just the html-source-code as a String, like in other libs lol) -> he really does a great job on that. 你得到的是真正的纯文本（不仅仅是作为字符串的html源代码，就像在其他库中一样） - >他真的做得很好。 It is more or less the same quality as Markdownify for PHP.... 它与Markdownify for PHP或多或少相同的质量....

Answer 2

This is what I found on google on it. 这是我在谷歌上发现的。 For me it worked fine. 对我来说它工作得很好。

String noHTMLString = htmlString.replaceAll("\\<.*?\\>", "");

Answer 3

Whatever you do, make sure you normalize the data before you start trying to strip tags. 无论您做什么，请确保在开始尝试剥离标记之前对数据进行标准化。 I recently attended a web app security workshop that covered XSS filter evasion. 我最近参加了一个涵盖XSS过滤器规避的Web应用安全研讨会。 One would normally think that searching for < or < 人们通常会认为搜索<或< or its hex equivalent would be sufficient. 或其十六进制等效就足够了。 I was blown away after seeing a slide with 70 ways that < can be encoded to beat filters. 我看到与70点的方式滑动后吹走<可以被编码到打滤波器。

Update: 更新：

Below is the presentation I was referring to, see slide 26 for the 70 ways to encode < . 下面是我所指的演示文稿，请参阅幻灯片26，了解70种编码方式< 。

Filter Evasion: Houdini on the Wire 过滤器逃避：电线上的Houdini

Answer 4

There may be some, but the most robust thing is to use an actual HTML parser. 可能有一些，但最强大的是使用实际的HTML解析器。 There's one here , and if it's reasonably well formed, you can also use SAX or another XML parser. 有一个在这里，如果它是相当不错形成，也可以使用SAX或另一个XML分析器。

Answer 5

After having this question open for almost a week, I can say with some certainty that there is no method available in the Java API or Apache libaries which strips HTML tags from a String. 在将这个问题打开将近一周之后，我可以肯定地说，Java API或Apache库中没有可用的方法从String中删除HTML标记。 You would either have to use an HTML parser as described in the previous answers, or write a simple regular expression to strip out the tags. 您可能必须使用前面答案中描述的HTML解析器，或者编写一个简单的正则表达式来去除标记。

Answer 6

When using Jsoup it's even easier than described in above answers: 使用Jsoup时，它比上面的答案更容易：

String html = "bla <b>hehe</b> <br> this is awesome simple";

String text = Jsoup.parse(html).text();

Answer 7

I've used nekoHtml to do that. 我用过nekoHtml来做到这一点。 It can strip all tags but it can just as easily keep or strip a subset of tags. 它可以剥离所有标签，但它可以轻松地保留或剥离标签的子集。

Answer 8

I know that this question is quite old, but I have been looking for this too and it seems that it is still not easy to find a good and easy solution in java. 我知道这个问题已经很老了，但我一直在寻找这个问题，似乎在java中找到一个好的和简单的解决方案似乎仍然不容易。

Today I came across this little functions lib. 今天我遇到了这个小函数库。 It actually attempts to imitate the php strip_tags function. 它实际上试图模仿php strip_tags函数。

http://jmelo.lyncode.com/java-strip_tags-php-function/ http://jmelo.lyncode.com/java-strip_tags-php-function/

It works like this (copied from their site): 它的工作原理如下（从他们的网站复制）：

    import static com.lyncode.jtwig.functions.util.HtmlUtils.stripTags;

    public class StripTagsExample {
      public static void main(String... args) {
        String result = stripTags("<!-- <a href='test'></a>--><a>Test</a>", "");
        // Produced result: Test
      }
    }

Answer 9

Hi I know this thread is old but it still came out tops on Google, and I was looking for a quick fix to the same problem. 嗨，我知道这个帖子已经老了，但它仍然出现在谷歌的顶部，我正在寻找快速修复同样的问题。 Couldn't find anything useful so I came up with this code snippet -- hope it helps someone. 找不到任何有用的东西，所以我想出了这段代码片段 - 希望它对某人有所帮助。 It just loops over the string and skips all the tags. 它只是循环遍历字符串并跳过所有标记。 Plain & simple. 简单明了。

boolean intag = false;
String inp = "<H1>Some <b>HTML</b> <span style=blablabla>text</span>";
String outp = "";

for (int i=0; i < inp.length(); ++i)
{
    if (!intag && inp.charAt(i) == '<')
        {
            intag = true;
            continue;
        }
        if (intag && inp.charAt(i) == '>')
        {
            intag = false;
            continue;
        }
        if (!intag)
        {
            outp = outp + inp.charAt(i);
        }
}   
return outp;

Answer 10

With pure iterative approach and no regex : 使用纯迭代方法，没有正则表达式：

public String stripTags(final String html) {

    final StringBuilder sbText = new StringBuilder(1000);
    final StringBuilder sbHtml = new StringBuilder(1000);

    boolean isText = true;

    for (char ch : html.toCharArray()) {
        if (isText) { // outside html
            if (ch != '<') {
                sbText.append(ch);
                continue;
            } else {   // switch mode             
                isText = false;      
                sbHtml.append(ch); 
                continue;
            }
        }else { // inside html
            if (ch != '>') {
                sbHtml.append(ch);
                continue;
            } else {      // switch mode    
                isText = true;     
                sbHtml.append(ch); 
                continue;
            }
        }
    }

    return sbText.toString();
}

Answer 11

Because of abbreviation (string truncation) of html fragment, I had also the problem of unclosed html tags that regex can't detect. 由于html片段的缩写（字符串截断），我还有正则表达式无法检测到的未闭合html标记的问题。 Eg: 例如：

Lorem ipsum dolor sit amet, <b>consectetur</b> adipiscing elit. <a href="abc"

So, referring to the 2 best answers (JSoup and regex), I preferred solution using JSoup: 所以，参考2个最佳答案（JSoup和regex），我更喜欢使用JSoup的解决方案：

Jsoup.parse(html).text()

Answer 12

Wicket uses the following method to escape html, located in: org.apache.wicket.util.string.Strings Wicket使用以下方法来转义html，位于：org.apache.wicket.util.string.Strings

public static CharSequence escapeMarkup(final String s, final boolean escapeSpaces,
    final boolean convertToHtmlUnicodeEscapes)
{
    if (s == null)
    {
        return null;
    }
    else
    {
        int len = s.length();
        final AppendingStringBuffer buffer = new AppendingStringBuffer((int)(len * 1.1));

        for (int i = 0; i < len; i++)
        {
            final char c = s.charAt(i);

            switch (c)
            {
                case '\t' :
                    if (escapeSpaces)
                    {
                        // Assumption is four space tabs (sorry, but that's
                        // just how it is!)
                        buffer.append("&nbsp;&nbsp;&nbsp;&nbsp;");
                    }
                    else
                    {
                        buffer.append(c);
                    }
                    break;

                case ' ' :
                    if (escapeSpaces)
                    {
                        buffer.append("&nbsp;");
                    }
                    else
                    {
                        buffer.append(c);
                    }
                    break;

                case '<' :
                    buffer.append("&lt;");
                    break;

                case '>' :
                    buffer.append("&gt;");
                    break;

                case '&' :

                    buffer.append("&amp;");
                    break;

                case '"' :
                    buffer.append("&quot;");
                    break;

                case '\'' :
                    buffer.append("&#039;");
                    break;

                default :

                    if (convertToHtmlUnicodeEscapes)
                    {
                        int ci = 0xffff & c;
                        if (ci < 160)
                        {
                            // nothing special only 7 Bit
                            buffer.append(c);
                        }
                        else
                        {
                            // Not 7 Bit use the unicode system
                            buffer.append("&#");
                            buffer.append(new Integer(ci).toString());
                            buffer.append(';');
                        }
                    }
                    else
                    {
                        buffer.append(c);
                    }

                    break;
            }
        }

        return buffer;
    }
}

Answer 13

public static String stripTags(String str) {
    int startPosition = str.indexOf('<');
    int endPosition;
    while (startPosition != -1) {
        endPosition = str.indexOf('>', startPosition);
        str = str.substring(0, startPosition) + (endPosition != -1 ? str.substring(endPosition + 1) : "");
        startPosition = str.indexOf('<');
    }
    return str;
}

在Java中剥离HTML标记

问题描述

13 个解决方案

解决方案1
33 2013-07-17 15:03:57

解决方案2
29 2011-11-27 01:18:30

解决方案3
29 2009-05-07 03:29:48

解决方案4
11 2009-05-07 02:29:39

解决方案5
11 已采纳 2009-05-13 17:53:59

解决方案6
7 2014-11-26 10:22:05

解决方案7
6 2009-05-07 03:03:19

解决方案8
5 2014-03-19 12:36:01

解决方案9
3 2012-08-23 00:03:56

解决方案10
3 2014-09-24 08:10:31

解决方案11
1 2017-01-23 14:28:41

解决方案12
0 2009-09-17 01:02:38

解决方案13
0 2016-01-31 13:00:50

在Java中剥离HTML标记

问题描述

13 个解决方案

解决方案1 33 2013-07-17 15:03:57

解决方案2 29 2011-11-27 01:18:30

解决方案3 29 2009-05-07 03:29:48

解决方案4 11 2009-05-07 02:29:39

解决方案5 11 已采纳 2009-05-13 17:53:59

解决方案6 7 2014-11-26 10:22:05

解决方案7 6 2009-05-07 03:03:19

解决方案8 5 2014-03-19 12:36:01

解决方案9 3 2012-08-23 00:03:56

解决方案10 3 2014-09-24 08:10:31

解决方案11 1 2017-01-23 14:28:41

解决方案12 0 2009-09-17 01:02:38

解决方案13 0 2016-01-31 13:00:50

解决方案1
33 2013-07-17 15:03:57

解决方案2
29 2011-11-27 01:18:30

解决方案3
29 2009-05-07 03:29:48

解决方案4
11 2009-05-07 02:29:39

解决方案5
11 已采纳 2009-05-13 17:53:59

解决方案6
7 2014-11-26 10:22:05

解决方案7
6 2009-05-07 03:03:19

解决方案8
5 2014-03-19 12:36:01

解决方案9
3 2012-08-23 00:03:56

解决方案10
3 2014-09-24 08:10:31

解决方案11
1 2017-01-23 14:28:41

解决方案12
0 2009-09-17 01:02:38

解决方案13
0 2016-01-31 13:00:50