简体   繁体   English

整理HTML后,jTidy不返回任何内容

[英]jTidy returns nothing after tidying HTML

I have come across a very annoying problem when using jTidy (on Android). 使用jTidy(在Android上)时遇到了一个非常烦人的问题。 I have found jTidy works on every HTML Document I have tested it against, except the following: 我发现jTidy可以在我测试过的每个HTML文档上工作,但以下情况除外:

    <!DOCTYPE html>
      <html lang="en">
       <head>
        <meta charset="utf-8" />

         <!-- Always force latest IE rendering engine & Chrome Frame 
              Remove this if you use the .htaccess -->
         <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />

         <title>templates</title>
         <meta name="description" content="" />
         <meta name="author" content="" />

         <meta name="viewport" content="width=device-width; initial-scale=1.0" />

         <!-- Replace favicon.ico & apple-touch-icon.png in the root of your domain and delete these references -->
      <link rel="shortcut icon" href="/favicon.ico" />
      <link rel="apple-touch-icon" href="/apple-touch-icon.png" />
   </head>

 <body>
   <div>
     <header>
       <h1>Page Heading</h1>
     </header>
     <nav>
       <p><a href="/">Home</a></p>
       <p><a href="/contact">Contact</a></p>
     </nav>

     <div>

     </div>

     <footer>
      <p>&copy; Copyright</p>
     </footer>
   </div>
 </body>
 </html>

But after tidying it, jTidy returns nothing (as in, if the String containing the Tidied HTML is called result, result.equals("") == true) 但是整理之后,jTidy不返回任何内容(例如,如果包含整理HTML的String称为result,则result.equals(“”)== true)

I have noticed something very interesting though: if I remove everything in the body part of the HTML jTidy works perfectly. 我注意到了一些非常有趣的事情:如果删除HTML正文中的所有内容,jTidy都可以正常工作。 Is there something in the <body></body> jTidy doesn't like? <body> </ body>中是否有jTidy不喜欢的东西?

Here is the Java code I am using: 这是我正在使用的Java代码:

 public String tidy(String sourceHTML) {
   StringReader reader = new StringReader(sourceHTML);

   ByteArrayOutputStream baos = new ByteArrayOutputStream();
   Tidy tidy = new Tidy();
   tidy.setMakeClean(true);
   tidy.setQuiet(false);
   tidy.setIndentContent(true);
   tidy.setSmartIndent(true);

   tidy.parse(reader, baos);

   try {
     return baos.toString(mEncoding);
   } catch (UnsupportedEncodingException e) {
     return null;
   }
 }

Is there something wrong with my Java? 我的Java有什么问题吗? Is this an error with jTidy? 这是jTidy的错误吗? Is there any way I can make jTidy not do this? 有什么办法可以使jTidy不这样做? (I cannot change the HTML). (我无法更改HTML)。 If this absolutely cannot be fixed, are there any other good HTML Tidiers? 如果绝对不能解决这个问题,那么还有其他好的HTML方法吗? Thanks very much! 非常感谢!

Try this: 尝试这个:

tidy.setForceOutput(true);

There are probably parse errors. 可能存在解析错误。

Check out Jsoup , it's my recommendation for any kind of Java Html processing (i've used HtmlCleaner to, but then switched to jsoup) . 查看Jsoup ,这是我对任何Java Html处理的建议(我曾经使用过HtmlCleaner,但是后来切换到jsoup)

Cleaning Html with Jsoup: 使用Jsoup清洗HTML:

final String yourHtml = ...

String output = Jsoup.clean(yourHtml, Whitelist.relaxed());

Thats all! 就这样!

Or (if you want to change / remove / parse / ...) something: 或者 (如果您要更改/删除/解析/ ...):

Document doc = Jsoup.parse(<file/string/website>, null);

String output = doc.toString();

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM