简体   繁体   English

Heritrix在条件注释块中找不到CSS文件

[英]Heritrix not finding CSS files in conditional comment blocks

The Problem/evidence 问题/证据

Heritrix is not detecting the presence of files in conditional comments that open & close in one string, such as this: Heritrix不会在一个字符串中打开和关闭的条件注释中检测到文件的存在,例如:

<!--[if (gt IE 8)|!(IE)]><!--> 
<link rel="stylesheet" href="/css/mod.css" />
<!--<![endif]-->

However standard conditional blocks like this work fine: 但是,像这样的标准条件块可以正常工作:

<!--[if lte IE 9]>
<script src="/js/ltei9.js"></script>
<![endif]-->

I've identified the problem as being with this part of the comment: 我发现问题出在注释的这一部分:

<!-->

Removal of that block in a test case then allows Heritrix to discover the css file. 然后,在测试用例中删除该块将使Herritrix发现css文件。

The Question 问题

How should I overcome this? 我应该如何克服呢? Is it a Heritrix bug, or is it something we can get around with a crawler-beans declaration? 它是Heritrix的错误,还是我们可以通过rawler-beans声明解决的问题? I'm aware that the comment block is there to "trick" certain browser versions, and changing the website code is not an option. 我知道这里有注释框可以“欺骗”某些浏览器版本,并且不能更改网站代码。 Can Heritrix be adapted to negate comments? Heritrix可以适应否定评论吗?

ExtractorHTML parses the page using the following regex: ExtractorHTML使用以下正则表达式解析页面:

 static final String RELEVANT_TAG_EXTRACTOR = "(?is)<(?:((script[^>]*+)>.*?</script)" + // 1, 2 "|((style[^>]*+)>.*?</style)" + // 3, 4 "|(((meta)|(?:\\\\w{1,"+MAX_ELEMENT_REPLACE+"}))\\\\s+[^>]*+)" + // 5, 6, 7 "|(!--(?!\\\\[if).*?--))>"; // 8 

Basically, cases 1 .. 7 match any interesting tags for link extractions, and case 8 matches HTML comments in order to discard them. 基本上,案例1 .. 7匹配任何有趣的标记以进行链接提取,案例8匹配HTML注释以便丢弃它们。 As you can see, case 8 carefully avoids matching comments in the form <!--[if ... --> , so that they are not discarded. 如您所见,案例8小心地避免了<!--[if ... -->形式的匹配注释,以使它们不会被丢弃。 So in your specific case, the <!--> that follows is matched as a starting comment, and it is discarded up to the last --> . 因此,在您的特定情况下,后面的<!-->被作为起始注释匹配,并被丢弃到最后一个-->

<!--[if (gt IE 8)|!(IE)]><!--> is a trick to make valid XHTML where the conditional content is parsed by any non IE browser. <!--[if (gt IE 8)|!(IE)]><!-->是制作有效XHTML的技巧,其中任何非IE浏览器都将解析条件内容。 Heritrix could be fixed here by making RELEVANT_TAG_EXTRACTOR not consider <!--> as a comment start. 可以通过使RELEVANT_TAG_EXTRACTOR不考虑<!-->作为注释开头来固定Heritrix。 This should work: 这应该工作:

 static final String RELEVANT_TAG_EXTRACTOR = "(?is)<(?:((script[^>]*+)>.*?</script)" + // 1, 2 "|((style[^>]*+)>.*?</style)" + // 3, 4 "|(((meta)|(?:\\\\w{1,"+MAX_ELEMENT_REPLACE+"}))\\\\s+[^>]*+)" + // 5, 6, 7 "|(!--(?!\\\\[if|>).*?--))>"; // 8 

You always can compile a java class inheriting org.archive.modules.extractor.ExtractorHTML with the fix, and use your class in place of the original ExtractorHTML. 您始终可以编译带有该修复程序并继承org.archive.modules.extractor.ExtractorHTML的Java类,并使用您的类代替原始的ExtractorHTML。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM