简体   繁体   中英

Remove all REAL Javascript comments in PHP

I'm looking for a solution to strip all javascript comments in an HTML code using PHP.

I want to strip only Javascript comments (not HTML comments and so on). I think that a regex is not a solution because it cannot understand if is a real comment or a part of a string. Example:

<script>

// This is a comment
/* This is another comment */

// The following is not a comment
var src="//google.com"; 

</script>

There is a way to do it? Many thanks in advance

First thing to do: you need to extract the content of script tags. For that, use DOMDocument:

$dom = new DOMDocument;
$dom->loadHTML($html);

$scriptNodes = $dom->getElementsByTagName('script');

The second step consists to remove all the javascript comments for each script node.

You can use a third party javascript parser if you want but you can do that with a regex too. All you need is to prevent parts between quotes to be taken in account.

To do that you must search first parts between quotes and discards them. The only difficulty to do that with javascript is that a quote can be inside a regex pattern, example:
/pattern " with a quote/

So you need to find patterns to prevent any error too.

Pattern example:

$pattern = <<<'EOD'
~
(?(DEFINE)
    (?<squoted> ' [^'\n\\]*+ (?: \\. [^'\n\\]* )*+ ' )
    (?<dquoted> " [^"\n\\]*+ (?: \\. [^"\n\\]* )*+ " )
    (?<tquoted> ` [^`\\]*+ (?s: \\. [^`\\]*)*+ ` )
    (?<quoted>  \g<squoted> | \g<dquoted> | \g<tquoted> )
    
    (?<scomment> // \N* )
    (?<mcomment> /\* [^*]*+ (?: \*+ (?!/) [^*]* )*+ \*/ )
    (?<comment> \g<scomment> | \g<mcomment> )
    
    (?<pattern> / [^\n/*] [^\n/\\]*+ (?>\\.[^\n/\\]*)* / [gimuy]* ) 
)

(?=[[(:,=/"'`])
(?|
    \g<quoted> (*SKIP)(*FAIL)
  |
    ( [[(:,=] \s* ) (*SKIP) (?: \g<comment> \s* )*+ ( \g<pattern> )
  | 
    ( \g<pattern> \s* ) (?: \g<comment> \s* )*+ 
    ( \. \s* ) (?:\g<comment> \s* )*+ ([A-Za-z_]\w*)
  |
    \g<comment>
)
~x
EOD;

Then you replace the content of each script nodes:

foreach ($scriptNodes as $scriptNode) {
    $scriptNode->nodeValue = preg_replace($pattern, '$9${10}${11}', $scriptNode->nodeValue);
}

$html = $dom->saveHTML();

demo

Pattern details:

((?DEFINE)...) is an area where you can put all subpattern definitions you will need later. The "real" pattern begins after.

(?<name>...) are named subpatterns. It's the same than a capture group except that you can refer to it with its name (like this \\g<name> ) instead of its number.

*+ are possessive quantifiers

\\N means a character that is not a newline

(?=[[(:,=/"' ])</code> is a [lookahead][3] that checks if the next character is one of these <code>[ ( : , = / " ' . The goal of this test is to prevent to test each branch of the following alternation if the character is different. If you remove it, the pattern will work the same, it's only to quickly skip useless positions in the string.

(*SKIP) is a backtracking control verb. When the pattern fails after it, all positions matched before it would not be tried.

(*FAIL) is a backtracking control verb too and forces the pattern to fail.

(?|..(..)..(..)..|..(..)..(..)..) is a branch-reset group. Inside it, the capture groups have respectively the same numbers (9 and 10 for this pattern) in each branch.

use this function

 function removeComments(str) { str = ('__' + str + '__').split(''); var mode = { singleQuote: false, doubleQuote: false, regex: false, blockComment: false, lineComment: false, condComp: false }; for (var i = 0, l = str.length; i < l; i++) { if (mode.regex) { if (str[i] === '/' && str[i-1] !== '\\') { mode.regex = false; } continue; } if (mode.singleQuote) { if (str[i] === "'" && str[i-1] !== '\\') { mode.singleQuote = false; } continue; } if (mode.doubleQuote) { if (str[i] === '"' && str[i-1] !== '\\') { mode.doubleQuote = false; } continue; } if (mode.blockComment) { if (str[i] === '*' && str[i+1] === '/') { str[i+1] = ''; mode.blockComment = false; } str[i] = ''; continue; } if (mode.lineComment) { if (str[i+1] === 'n' || str[i+1] === 'r') { mode.lineComment = false; } str[i] = ''; continue; } if (mode.condComp) { if (str[i-2] === '@' && str[i-1] === '*' && str[i] === '/') { mode.condComp = false; } continue; } mode.doubleQuote = str[i] === '"'; mode.singleQuote = str[i] === "'"; if (str[i] === '/') { if (str[i+1] === '*' && str[i+2] === '@') { mode.condComp = true; continue; } if (str[i+1] === '*') { str[i] = ''; mode.blockComment = true; continue; } if (str[i+1] === '/') { str[i] = ''; mode.lineComment = true; continue; } mode.regex = true; } } return str.join('').slice(2, -2); }

Use these two links http://trinithis.awardspace.com/commentStripper/stripper.html

http://james.padolsey.com/javascript/removing-comments-in-javascript/

further reference check it Javascript comment stripper

This RegExp will work for your example:

^\/(?:\/|\*).*

PHP code:

$re = "/^\\/(?:\\/|\\*).*/m"; 
$str = "<script>\n\n// This is a comment\n/* This is another comment */\n\n// The following is not a comment\nvar src=\"//google.com\"; \n\n</script>"; 

preg_match_all($re, $str, $matches);

DEMO


Or maybe this, to validate */ :

^\/{2}.*|\/\*.*\*\/$

PHP code:

$re = "/^\\/{2}.*|\\/\\*.*\\*\\/$/m"; 
$str = "<script>\n\n// This is a comment\n/* This is another comment */\n\n// The following is not a comment\nvar src=\"//google.com\"; \n\n</script>"; 

preg_match_all($re, $str, $matches);

DEMO2

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM