简体   繁体   English

Java从没有正则表达式的String中删除HTML

[英]Java remove HTML from String without regular expressions

I am trying to remove all HTML elements from a String. 我试图从String中删除所有HTML元素。 Unfortunately, I cannot use regular expressions because I am developing on the Blackberry platform and regular expressions are not yet supported. 不幸的是,我不能使用正则表达式,因为我在Blackberry平台上开发并且还不支持正则表达式。

Is there any other way that I can remove HTML from a string? 有没有其他方法可以从字符串中删除HTML? I read somewhere that you can use a DOM Parser, but I couldn't find much on it. 我在某处读过你可以使用DOM Parser,但我找不到太多东西。

Text with HTML: 带HTML的文字:

<![CDATA[As a massive asteroid hurtles toward Earth, NASA head honcho Dan Truman (<a href="http://www.netflix.com/RoleDisplay/Billy_Bob_Thornton/20000303">Billy Bob Thornton</a>) hatches a plan to split the deadly rock in two before it annihilates the entire planet, calling on Harry Stamper (<a href="http://www.netflix.com/RoleDisplay/Bruce_Willis/99786">Bruce Willis</a>) -- the world's finest oil driller -- to head up the mission. With time rapidly running out, Stamper assembles a crack team and blasts off into space to attempt the treacherous task. <a href="http://www.netflix.com/RoleDisplay/Ben_Affleck/20000016">Ben Affleck</a> and <a href="http://www.netflix.com/RoleDisplay/Liv_Tyler/162745">Liv Tyler</a> co-star.]]>

Text without HTML: 没有HTML的文字:

As a massive asteroid hurtles toward Earth, NASA head honcho Dan Truman (Billy Bob Thornton) hatches a plan to split the deadly rock in two before it annihilates the entire planet, calling on Harry Stamper (Bruce Willis) -- the world's finest oil driller -- to head up the mission. 当一颗巨大的小行星撞向地球时,美国国家航空航天局局长丹·杜鲁门(比利鲍勃桑顿)制定了一项计划,在将整个行星歼灭之前将致命的岩石分成两部分,呼吁哈利·斯坦克(布鲁斯·威利斯) - 世界上最好的石油钻探者 - 领导任务。 With time rapidly running out, Stamper assembles a crack team and blasts off into space to attempt the treacherous task.Ben Affleck and Liv Tyler co-star. 随着时间的推移,Stamper汇集了一支精干的团队并向太空发起冲击,试图进行危险的任务。本阿弗莱克和丽芙泰勒共同出演。

Thanks! 谢谢!

There are a lot of nuances to parsing HTML in the wild, one of the funnier ones being that many pages out there do not follow any standard. 在野外解析HTML有很多细微差别,其中一个更有趣的方法就是许多页面都没有遵循任何标准。 This said, if all your HTML is going to be as simple as your example, something like this is more than enough: 这就是说,如果你的所有HTML都像你的例子一样简单,那么这样就足够了:

    char[] cs = s.toCharArray();
    StringBuilder sb = new StringBuilder();
    boolean tag = false;
    for (int i=0; i<cs.length; i++) {
        switch(cs[i]) {
            case '<': if ( ! tag) { tag = true; break; }
            case '>': if (tag) { tag = false; break; }
            case '&': i += interpretEscape(cs, i, sb); break;
            default: if ( ! tag) sb.append(cs[i]);
        }
    }
    System.err.println(sb);

Where interpretEscape() is supposed to know how to convert HTML escapes such as &gt; 其中interpretEscape()应该知道如何转换HTML转义,例如&gt; to their character counterparts, and skip all characters up to the ending ; 他们的角色对应物,并跳过所有角色直到结尾; .

I cannot use regular expressions because I am developing on the Blackberry platform 我无法使用正则表达式,因为我正在Blackberry平台上进行开发

You cannot use regular expressions because HTML is a recursive language and regular expressions can't handle those. 您不能使用正则表达式,因为HTML是一种递归语言,而正则表达式无法处理这些语法。

You need a parser. 你需要一个解析器。

If you can add external jars you can try with those two small libs: 如果你可以添加外部jar,你可以尝试使用这两个小库:

they both allow you to strip everything. 他们都允许你去除一切。

I used jericho many times, to strip you define an extractor as you like it: 我多次使用jericho,剥离你定义一个你喜欢的提取器:

class HTMLStripExtractor extends TextExtractor
{
    public HTMLStripExtractor(Source src)
    {       
        super(src)  
        src.setLogger(null)
    }

    public boolean excludeElement(StartTag startTag)
    {
        return startTag.getName() != HTMLElementName.A
    }
}

I'd try to tackle this the other way around, create a DOM tree from the HTML and then extract the string from the tree: 我试图以相反的方式解决这个问题,从HTML创建一个DOM树,然后从树中提取字符串:

  • Use a library like TagSoup to parse in the HTML while cleaning it up to be close to XHTML. 使用像TagSoup这样的库来解析HTML,同时将其清理为接近XHTML。
  • As you're streaming the cleaned up XHTML, extract the text you want. 当您正在流式传输已清理的XHTML时,请提取所需的文本。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM