简体   繁体   English

在HTML中查找注释

[英]Finding comments in HTML

I have an HTML file and within it there may be Javascript, PHP and all this stuff people may or may not put into their HTML file. 我有一个HTML文件,其中可能包含Javascript,PHP,而人们可能会或可能不会将所有这些东西放入HTML文件中。

I want to extract all comments from this html file. 我想从此html文件中提取所有注释。

I can point out two problems in doing this: 我可以指出这样做的两个问题:

  1. What is a comment in one language may not be a comment in another. 用一种语言发表的评论可能不是用另一种语言发表的评论。

  2. In Javascript, remainder of lines are commented out using the // marker. 在Javascript中,其余行使用//标记注释掉。 But URLs also contain // within them and I therefore may well eliminate parts of URLs if I just apply substituting // and then the remainder of the line, with nothing. 但是,URL中也包含// ,因此,如果我只应用// ,然后替换行中的其余内容,则不添加任何内容,因此我很可能会消除URL的某些部分。

So this is not a trivial problem. 因此,这不是一个小问题。

Is there anywhere some solution for this already available? 已经有解决方案吗?

Has anybody already done this? 有人做过吗?

Problem 2: Isn't every url quoted, with either "www.url.com" or 'www.url.com', when you write it in either language? 问题2:当您用任何一种语言编写网址时,不是每个网址都用“ www.url.com”或“ www.url.com”引用吗? I'm not sure. 我不确定。 If that's the case then all you haft to do is to parse the code and check if there's any quote marks preceding the backslashes to know if it's a real url or just a comment. 如果是这种情况,那么您要做的就是解析代码,并检查反斜杠之前是否有任何引号,以了解它是真实的网址还是仅仅是注释。

Look into parser generators like ANTLR which has grammars for many languages and write a nesting parser to reliably find comments. 查看诸如ANTLR之类的解析器生成器,该生成器具有多种语言的语法,并编写一个嵌套解析器以可靠地查找注释。 Regular expressions aren't going to help you if accuracy is important. 如果准确性很重要,则正则表达式将无济于事。 Even then, it won't be 100% accurate. 即使那样,它也不是100%准确的。

Consider 考虑

Problem 3, a comment in a language is not always a comment in a language. 问题3,用一种语言发表的评论并不总是一种用语言发表的评论。

<textarea><!-- not a comment --></textarea>
<script>var re = /[/*]not a comment[*/]/, str = "//not a comment";</script>

Problem 4, a comment embedded in a language may not obviously be a comment. 问题4,语言中嵌入的注释可能显然不是注释。

<button onclick="&#47;&#47; this is a comment//&#10;notAComment()">

Problem 5, what is a comment may depend on how the browser is configured. 问题5,注释是什么取决于浏览器的配置方式。

<noscript><!-- </noscript> Whether this is a comment depends on whether JS is turned on -->
<!--[if IE 8]>This is a comment, except on IE 8<![endif]-->

I had to solve this problem partially for contextual templating systems that elide comments from source code to prevent leaking software implementation details. 对于上下文模板系统,我不得不部分解决此问题,该系统从源代码中删除注释以防止泄漏软件实现细节。

https://github.com/mikesamuel/html-contextual-autoescaper-java/blob/master/src/tests/com/google/autoesc/HTMLEscapingWriterTest.java#L1146 shows a testcase where a comment is identified in JavaScript, and later testcases show comments identified in CSS and HTML. https://github.com/mikesamuel/html-contextual-autoescaper-java/blob/master/src/tests/com/google/autoesc/HTMLEscapingWriterTest.java#L1146显示了一个用JavaScript标识注释的测试用例,以后测试用例显示以CSS和HTML标识的注释。 You may be able to adapt that code to find comments. 您也许可以修改该代码以查找注释。 It will not handle comments in PHP code sections. 它不会处理PHP代码部分中的注释。

It seems from your word that you are pondering some approach based on regular expressions: it is a pain to do so on the whole file, try to use some tools to highlight or to discard interesting or uninteresting text and then work on what is left from your sieve according to the keep/discard criteria. 从您的话看来,您正在考虑基于正则表达式的某种方法:在整个文件上这样做是很痛苦的,尝试使用一些工具来突出显示或丢弃有趣或无趣的文本,然后处理剩下的内容。根据保留/丢弃标准筛选筛子。 Have a look at HTML::Tree and TreeBuilder, it could be very useful to deal with the HTML markup. 看一下HTML :: Tree和TreeBuilder,处理HTML标记可能非常有用。

I would convert the HTML file into a character array and parse it. 我会将HTML文件转换为字符数组并进行解析。 You can detect key strings like "<", "--" ,"www", "http", as you move forward and either skip or delete those segments. 前进时,您可以检测到诸如“ <”,“-”,“ www”,“ http”之类的键字符串,可以跳过或删除这些段。

The start/end indices will have to be identified properly, which is a challenge but you will have full power. 必须正确识别开始/结束索引,这是一个挑战,但是您将拥有全部功能。

There are also other ways to simplify the process if performance is not a problem. 如果性能不成问题,还有其他方法可以简化过程。 For example, all tags can be grabbed with XML::Twig and the string can be parsed to detect JS comments. 例如,可以使用XML :: Twig捕获所有标签,并且可以解析该字符串以检测JS注释。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM