简体   繁体   English

JTidy报告“发现了3个错误!”……但没有说明它们是什么。

[英]JTidy reports “3 errors were found!”… but does not say what they are

I have a large block of programmatically generated HTML. 我有一大堆以编程方式生成的HTML。 I ran it through Tidy (version r938) with the following Java code: 我使用以下Java代码通过Tidy(版本r938)运行了该代码:

StringReader inStr = new StringReader(htmlInput);
StringWriter outStr = new StringWriter();
Tidy tidy = new Tidy();
tidy.setXHTML(true);
tidy.parseDOM(inStr, outStr);

I get the following output: 我得到以下输出:

InputStream: Document content looks like HTML 4.01 Transitional
247 warnings, 3 errors were found!
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.

Trouble is, Tidy doesn't tell me what 3 errors it found. 麻烦的是,Tidy没有告诉我发现了3个错误。

I'm fibbing here a little. 我在这里摆弄一点。 The output above actually follows a long list of all 247 warnings (mostly trimming out empty div elements). 上面的输出实际上是所有247条警告的一长串(主要是修剪掉空的div元素)。 I can suppress those with tidy.setShowWarnings(false) ; 我可以使用tidy.setShowWarnings(false)禁止显示这些tidy.setShowWarnings(false) either way, I see no error report, so I can't figure out what I need to fix. 无论哪种方式,我都看不到错误报告,因此无法弄清楚需要解决的问题。 300Kb of HTML is too much for me to eyeball. 300Kb的HTML对我来说实在太大了。

I've tried numerous approaches to finding the error. 我尝试了多种方法来查找错误。 I can't run it through validate.w3.org, sadly, as the HTML file is on a proprietary network. 遗憾的是,由于HTML文件位于专有网络上,因此我无法通过validate.w3.org运行它。 The most informative approach was to open it in IntelliJ IDEA; 最有用的方法是在IntelliJ IDEA中打开它。 this revealed a dozen or so duplicate div IDs, which I fixed. 这显示了十几个重复的div ID,我已对其进行了修复。 Errors still occurred. 仍然发生错误。

I've looked around for other mentions of this problem. 我到处寻找有关此问题的其他提及。 While I find plenty of hits on things like "How can I get the error/warning messages out of the parsed HTML using JTidy?" 当我发现诸如“如何使用JTidy如何从已解析的HTML中获取错误/警告消息?”之类的热门文章时, , they all appear to be asking for dissimilar things, or assume conditions that simply aren't holding for me. ,他们似乎都在要求不同的东西,或者假设条件根本不适合我。 I'm getting warnings just fine, for example; 例如,我得到的警告很好。 it's the errors I need, and they're not being reported, even if I call setShowErrors(100) or something. 这是我需要的错误 ,即使我调用setShowErrors(100)东西也没有得到报告。

Am I going to have to dive into Tidy's source code and debug it, starting where it reports errors? 我是否必须深入Tidy的源代码并对其进行调试,从报告错误的地方开始? Or is there something much simpler I could do? 还是我可以做些更简单的事情?

Here's what I ended up doing to track down the errors: 这是我最终要找出错误的方法:

  1. Download JTidy's source. 下载JTidy的源代码。 Most people should be able to go straight to the source . 大多数人应该可以直接找到源头
  2. Unzip the source into my dev area. 将源文件解压缩到我的开发区域。 Right on top of my existing source code. 就在我现有的源代码之上。 This also meant removing the Maven entry for JTidy from my pom.xml. 这也意味着从我的pom.xml中删除JTidy的Maven条目。 (It also meant beating IntelliJ into submission (re: editing the relevant .iml files and restarting IJ a lot) when it got extremely confused by this.) (这也意味着当IntelliJ对此感到非常困惑时,它会击败IntelliJ提交(重新:编辑相关的.iml文件并大量重启IJ)。)
  3. Set a breakpoint in Report.error. 在Report.error中设置一个断点。 The first line of org.w3.tidy.Report.error() increments lexer.errors ; org.w3.tidy.Report.error()的第一行增加lexer.errors error() is called from many places in the lexer. 在词法分析器的许多地方调用error()
  4. Run my program in debug mode. 在调试模式下运行我的程序。 Expect this to take a little while if the input HTML is large; 如果输入的HTML很大,这会花费一些时间; a 300k file took around 10-15 seconds on my machine to stop on an error that turned out to be at the very end of the file. 一个300k的文件在我的计算机上花费了大约10-15秒的时间,以阻止出现在文件末尾的错误。
  5. Look at the contents of lexbuf. 查看lexbuf的内容。 lexbuf is a byte array, so your IDE might not show it as text. lexbuf是一个字节数组,因此您的IDE可能不会将其显示为文本。 It might also be large. 它也可能很大。 You probably want to look at what index the lexer was looking at within lexbuf . 您可能想要查看lexbuf的词法分析器正在查看的lexbuf If you have to, take that section of the byte array and cross-reference it with an ASCII table to get the text. 如有必要,请使用字节数组的该部分,并将其与ASCII表进行交叉引用以获取文本。
  6. Search for that text in your HTML. 在HTML中搜索该文本。 Assuming it appears only once, there's your error. 假设它只出现一次,那就是您的错误。 (In my case, it appeared exactly three times, and sure enough, I had three errors reported.) (就我而言,它出现了三遍,而且确实,我报告了三个错误。)

This was much more involved than it probably should have been. 这比原本应该涉及的要复杂得多。 I suspect Report.error() was being called inappropriately. 我怀疑Report.error()被不当调用。

In my case, error() was called with the constant BAD_CDATA_CONTENT . 在我的情况下,使用常量BAD_CDATA_CONTENT调用error() This constant is used only by Report.warning() . 此常量仅由Report.warning() error() doesn't know what to do with it, and just exits silently with no message at all . error()不知道如何处理它,只是安静地退出,根本没有任何消息 If I change the call in Lexer.getCDATA() from error() to warning() , I get the exact line and column of my error. 如果将Lexer.getCDATA()的调用从error()更改为warning() ,则将获得错误的确切行和列。 (I also get what appears to be reasonably well-formed XHTML, instead of an empty document.) (我也得到了看上去格式合理的XHTML,而不是空文档。)

I'd submit a ticket to the JTidy project with some suggestions, but SourceForge isn't letting me log in for some reason. 我会向JTidy项目提交票证并提供一些建议,但是出于某些原因,SourceForge不允许我登录。 So, here: 所以在这里:

  • Given that this "error" appears not to doom the document to unparseability, I'll tentatively suggest that that call be made a warning instead. 鉴于这种“错误”似乎不会使文档注定无法解析,因此我暂时建议将该调用改为警告。 (In my specific case, it was an HTML tag inside a string constant or comment inside a script element; shouldn't have hurt anything. I asked another question about it , just in case.) (在我的特定情况下,它是字符串常量内的HTML标记或script元素内的注释;应该不会造成任何伤害。为万一, 我问了另一个问题 。)
  • Report.error() should have a default case that reports an unhandled error code if it gets one. Report.error()应该有一个默认情况,如果它得到一个,它将报告未处理的错误代码。

Hope this helps anyone else having what I'm guessing is a rather esoteric problem. 希望这可以帮助其他任何有我猜测是深奥的问题的人。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Jasper报告:找不到类错误 - Jasper reports: getting errors of class not found 说接口也是一种类型是什么意思? - What does it mean to say that an interface is also a type? 说一个类型是“盒装”是什么意思? - What does it mean to say a type is “boxed”? 说内容被锁定是什么意思? - What does it mean to say that a list is locked internally? 当他们说http是无国籍时,这是什么意思 - what does it mean when they say http is stateless 为什么 Eclipse 说“存在错误”但在控制台中什么也没显示? - Why does Eclipse say that “errors exist” but shows nothing in console? 我可以配置JTidy来忽略某些错误和警告吗? - Can I configure JTidy to ignore certain errors and warnings? 为什么Eclipse的Google Cloud Tools插件会显示“未找到项目”? - Why does Google Cloud Tools Plugin for Eclipse say “No Projects Found”? 当测试报告显示没有测试失败时,为什么 Gradle 会说“有失败的测试”? - Why does Gradle say "There were failing tests" when the test report shows no test failures? 当我们说Hashtable或Vector是同步的时候意味着什么? - What does it mean when we say Hashtable or Vector is synchronized?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM