Heritrix在条件注释块中找不到CSS文件

Question

The Problem/evidence 问题/证据

Heritrix is not detecting the presence of files in conditional comments that open & close in one string, such as this: Heritrix不会在一个字符串中打开和关闭的条件注释中检测到文件的存在，例如：

<!--[if (gt IE 8)|!(IE)]><!--> 
<link rel="stylesheet" href="/css/mod.css" />
<!--<![endif]-->

However standard conditional blocks like this work fine: 但是，像这样的标准条件块可以正常工作：

<!--[if lte IE 9]>
<script src="/js/ltei9.js"></script>
<![endif]-->

I've identified the problem as being with this part of the comment: 我发现问题出在注释的这一部分：

<!-->

Removal of that block in a test case then allows Heritrix to discover the css file. 然后，在测试用例中删除该块将使Herritrix发现css文件。

The Question 问题

How should I overcome this? 我应该如何克服呢？ Is it a Heritrix bug, or is it something we can get around with a crawler-beans declaration? 它是Heritrix的错误，还是我们可以通过rawler-beans声明解决的问题？ I'm aware that the comment block is there to "trick" certain browser versions, and changing the website code is not an option. 我知道这里有注释框可以“欺骗”某些浏览器版本，并且不能更改网站代码。 Can Heritrix be adapted to negate comments? Heritrix可以适应否定评论吗？

Answer 1

ExtractorHTML parses the page using the following regex: ExtractorHTML使用以下正则表达式解析页面：

 static final String RELEVANT_TAG_EXTRACTOR = "(?is)<(?:((script[^>]*+)>.*?</script)" + // 1, 2 "|((style[^>]*+)>.*?</style)" + // 3, 4 "|(((meta)|(?:\\\\w{1,"+MAX_ELEMENT_REPLACE+"}))\\\\s+[^>]*+)" + // 5, 6, 7 "|(!--(?!\\\\[if).*?--))>"; // 8

Basically, cases 1 .. 7 match any interesting tags for link extractions, and case 8 matches HTML comments in order to discard them. 基本上，案例1 .. 7匹配任何有趣的标记以进行链接提取，案例8匹配HTML注释以便丢弃它们。 As you can see, case 8 carefully avoids matching comments in the form  , so that they are not discarded. 如您所见，案例8小心地避免了形式的匹配注释，以使它们不会被丢弃。 So in your specific case, the  . 因此，在您的特定情况下，后面的 。

 is a trick to make valid XHTML where the conditional content is parsed by any non IE browser. 是制作有效XHTML的技巧，其中任何非IE浏览器都将解析条件内容。 Heritrix could be fixed here by making RELEVANT_TAG_EXTRACTOR not consider 作为注释开头来固定Heritrix。 This should work: 这应该工作：

 static final String RELEVANT_TAG_EXTRACTOR = "(?is)<(?:((script[^>]*+)>.*?</script)" + // 1, 2 "|((style[^>]*+)>.*?</style)" + // 3, 4 "|(((meta)|(?:\\\\w{1,"+MAX_ELEMENT_REPLACE+"}))\\\\s+[^>]*+)" + // 5, 6, 7 "|(!--(?!\\\\[if|>).*?--))>"; // 8

You always can compile a java class inheriting org.archive.modules.extractor.ExtractorHTML with the fix, and use your class in place of the original ExtractorHTML. 您始终可以编译带有该修复程序并继承org.archive.modules.extractor.ExtractorHTML的Java类，并使用您的类代替原始的ExtractorHTML。

Heritrix在条件注释块中找不到CSS文件

问题描述

1 个解决方案

解决方案1
2 已采纳 2015-06-18 18:54:08

Heritrix在条件注释块中找不到CSS文件

问题描述

1 个解决方案

解决方案1 2 已采纳 2015-06-18 18:54:08

解决方案1
2 已采纳 2015-06-18 18:54:08