Finding comments in HTML

Question

I have an HTML file and within it there may be Javascript, PHP and all this stuff people may or may not put into their HTML file.

I want to extract all comments from this html file.

I can point out two problems in doing this:

What is a comment in one language may not be a comment in another.
In Javascript, remainder of lines are commented out using the // marker. But URLs also contain // within them and I therefore may well eliminate parts of URLs if I just apply substituting // and then the remainder of the line, with nothing.

So this is not a trivial problem.

Is there anywhere some solution for this already available?

Has anybody already done this?

Answer 1

Problem 2: Isn't every url quoted, with either "www.url.com" or 'www.url.com', when you write it in either language? I'm not sure. If that's the case then all you haft to do is to parse the code and check if there's any quote marks preceding the backslashes to know if it's a real url or just a comment.

Answer 2

Look into parser generators like ANTLR which has grammars for many languages and write a nesting parser to reliably find comments. Regular expressions aren't going to help you if accuracy is important. Even then, it won't be 100% accurate.

Consider

Problem 3, a comment in a language is not always a comment in a language.

<textarea><!-- not a comment --></textarea>
<script>var re = /[/*]not a comment[*/]/, str = "//not a comment";</script>

Problem 4, a comment embedded in a language may not obviously be a comment.

<button onclick="&#47;&#47; this is a comment//&#10;notAComment()">

Problem 5, what is a comment may depend on how the browser is configured.

<noscript><!-- </noscript> Whether this is a comment depends on whether JS is turned on -->
<!--[if IE 8]>This is a comment, except on IE 8<![endif]-->

I had to solve this problem partially for contextual templating systems that elide comments from source code to prevent leaking software implementation details.

https://github.com/mikesamuel/html-contextual-autoescaper-java/blob/master/src/tests/com/google/autoesc/HTMLEscapingWriterTest.java#L1146 shows a testcase where a comment is identified in JavaScript, and later testcases show comments identified in CSS and HTML. You may be able to adapt that code to find comments. It will not handle comments in PHP code sections.

Answer 3

It seems from your word that you are pondering some approach based on regular expressions: it is a pain to do so on the whole file, try to use some tools to highlight or to discard interesting or uninteresting text and then work on what is left from your sieve according to the keep/discard criteria. Have a look at HTML::Tree and TreeBuilder, it could be very useful to deal with the HTML markup.

Answer 4

I would convert the HTML file into a character array and parse it. You can detect key strings like "<", "--" ,"www", "http", as you move forward and either skip or delete those segments.

The start/end indices will have to be identified properly, which is a challenge but you will have full power.

There are also other ways to simplify the process if performance is not a problem. For example, all tags can be grabbed with XML::Twig and the string can be parsed to detect JS comments.

Finding comments in HTML

Question

4 answers

solution1
2 2012-10-19 15:04:10

solution2
1 2012-10-19 15:15:20

solution3
0 2012-10-19 12:34:09

solution4
0 2012-10-21 21:40:29

Finding comments in HTML

Question

4 answers

solution1 2 2012-10-19 15:04:10

solution2 1 2012-10-19 15:15:20

solution3 0 2012-10-19 12:34:09

solution4 0 2012-10-21 21:40:29

solution1
2 2012-10-19 15:04:10

solution2
1 2012-10-19 15:15:20

solution3
0 2012-10-19 12:34:09

solution4
0 2012-10-21 21:40:29