简体   繁体   English

正确使用JTidy净化HTML

[英]Proper usage of JTidy to purify HTML

I am trying to use JTidy (jtidy-r938.jar) to sanitize an input HTML string, but I seem to have problems getting the default settings right. 我正在尝试使用JTidy(jtidy-r938.jar)清理输入的HTML字符串,但是正确设置默认设置似乎有些困难。 Often strings such as "hello world" end up as "helloworld" after tidying. 整理后,诸如“ hello world”之类的字符串通常以“ helloworld”结尾。 I wanted to show what I'm doing here, and any pointers would be really appreciated: 我想展示我在这里正在做的事情,任何指针将不胜感激:

Assume that rawHtml is the String containing the input (real world) HTML. 假设rawHtml是包含输入(真实世界)HTML的字符串。 This is what I'm doing: 这就是我在做什么:

        Tidy tidy = new Tidy();
        tidy.setPrintBodyOnly(true);

        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        PrintStream ps = new PrintStream(baos);

        tidy.parse(new StringReader(rawHtml), ps);
        return baos.toString("UTF8");   

First off, does anything look fundamentally wrong with the above code? 首先,上面的代码看起来有什么根本不对的地方吗? I seem to be getting weird results with this. 我似乎对此感到奇怪的结果。

For example, consider the following input: 例如,考虑以下输入:

<p class="MsoNormal" style="text-autospace:none;"><font color="black"><span style="color:black;">???</span></font><b><font color="#7f0055"><span style="color:#7f0055;font-weight:bold;">private</span></font></b><font color="black"><span style="color:black;"> String parseDescription</span></font><font>

The output is: 输出为:

<p class="MsoNormal" style="text-autospace:none;"><font color= "black"><span style="color:black;">&nbsp;&nbsp;&nbsp;</span></font> <b><font color="#7F0055"><span style= "color:#7f0055;font-weight:bold;">private</span></font></b><font color="black"><span style="color:black;">String parseDescription</span></font></p>

So, 所以,

"public String parseDescription" becomes "publicString parseDescription" “ public String parseDescription”成为“ publicString parseDescription”

Thanks in advance! 提前致谢!

Have a look at how JTidy is configured: 看一下JTidy的配置方式:

StringWriter writer = new StringWriter();
tidy.getConfiguration().printConfigOptions(writer, true);
System.out.println(writer.toString());

Maybe it then get clear what causes the problem. 也许然后弄清楚是什么原因引起的。

What is weird? 什么奇怪 Little example, of actual output and expected... maybe ? 实际输出和预期的小例子...也许?

Well, this seems to be a bug in Jtidy. 好吧,这似乎是Jtidy中的错误。 For the exact file which causes problems, refer here: 有关导致问题的确切文件,请参见此处:

http://sourceforge.net/tracker/?func=detail&aid=2985849&group_id=13153&atid=113153 http://sourceforge.net/tracker/?func=detail&aid=2985849&group_id=13153&atid=113153

Thanks for all the help folks! 感谢大家的帮助!

Here is how we are calling JTidy from Ant. 这就是我们从Ant调用JTidy的方式。 You may infer the API call from it: 您可以从中推断出API调用:

<tidy destdir="${build.dir.result}">
  <fileset dir="${src}" includes="**/*.htm"/>
  <parameter name="tidy-mark" value="false"/>
  <parameter name="output-xml" value="no"/>
  <parameter name="numeric-entities" value="yes"/>
  <parameter name="indent-spaces" value="2"/>
  <parameter name="indent-attributes" value="no"/>
  <parameter name="markup" value="yes"/>
  <parameter name="wrap" value="2000"/>
  <parameter name="uppercase-tags" value="no"/>
  <parameter name="uppercase-attributes" value="no"/>
  <parameter name="quiet" value="no"/>
  <parameter name="clean" value="yes"/>
  <parameter name="show-warnings" value="yes"/>
  <parameter name="break-before-br" value="yes"/>
  <parameter name="hide-comments" value="yes"/>
  <parameter name="char-encoding" value="latin1"/>
  <parameter name="output-html" value="yes"/>
</tidy>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM