Perl正则表达式在线PCRE测试员工作但不在perl命令中

Question

I've written the following PCRE regex to strip scripts from HTML pages: <script.*?>[\\s\\S]*?< *?\\/ *?script *?> 我编写了以下PCRE正则表达式来从HTML页面中删除脚本： <script.*?>[\\s\\S]*?< *?\\/ *?script *?>

It works on many online PCRE regex testers: 它适用于许多在线PCRE正则表达式测试人员：

https://regex101.com/r/lsxyI6/1 https://regex101.com/r/lsxyI6/1

https://www.regextester.com/?fam=102647 https://www.regextester.com/?fam=102647

It does NOT work when I run the following perl substitution command in a bash terminal: cat tmp.html | perl -pe 's/<script.*?>[\\s\\S]*?< *?\\/ *?script *?>//g' 当我运行在bash终端下面的Perl替换命令它不工作： cat tmp.html | perl -pe 's/<script.*?>[\\s\\S]*?< *?\\/ *?script *?>//g' cat tmp.html | perl -pe 's/<script.*?>[\\s\\S]*?< *?\\/ *?script *?>//g'

I am using the following test data: 我使用以下测试数据：

<script>
                       $(document).ready(function() {
                           var url = window.location.href;
                           var element = $('ul.nav a').filter(function() {
                               if (url.charAt(url.length - 1) == '/') {
                                   url = url.substring(0, url.length - 1);
                               }

                               return this.href == url;
                           }).parent();

                           if (element.is('li')) {
                               element.addClass('active');
                           }
                       });
                   </script>

PS I am using regex to parse HTML because the HTML parser I am forced to use (xmlpath) breaks when there are complex scripts on the page. PS我正在使用正则表达式解析HTML，因为当页面上有复杂的脚本时，我被迫使用的HTML解析器（xmlpath）会中断。 I am using this regex to remove scripts from the page before passing it to the parser. 我正在使用此正则表达式从页面中删除脚本，然后将其传递给解析器。

Answer 1

You need to tell perl not to break up each line of the file into its own separate record with -0 . 你需要告诉perl不要将文件的每一行拆分成它自己的单独记录-0 。

 perl -0 -pe 's/<script.*?>[\s\S]*?< *?\/ *?script *?>//g' tmp.html

This actually tells perl to break up records on '\\0' . 这实际上告诉perl打破'\\0'上的记录。 perl -0777 will very explicitly slurp the whole file. perl -0777将非常明确地perl -0777整个文件。

Answer 2

By the way, because I find slurping whole files distasteful, and because I don't care what html has to say about line breaks...a quicker, cleaner, more correct way to do this IF you can guarantee there is no important content on <script> tag lines is: 顺便说一句，因为我发现整个文件都令人讨厌，而且因为我不关心html对于换行的说法... 如果你能保证没有重要的内容，那么更快，更清洁，更正确的方法在<script>标记行上是：

perl -ne 'print if !(/<script>/../<\/script>/)' tmp.html

(modifying the two regexes to your fancy, of course) .. is a stateful operator that is flipped on by the expression before it being true and off by the one after being true. （当然，将两个正则表达式修改为你的想法） ..是一个有状态的运算符，它在表达式被真实之前被表达式翻转，并且在被表达式之后由真正的表达式关闭。

~/test£ cat example.html
<important1/>
<edgecase1/><script></script><edgecase2/>
<important2/>
<script></script>
<important3/>
<script>
<notimportant/>
</script>

~/test£ perl -ne 'print if !(/<script>/../<\/script>/)' example.html
<important1/>
<important2/>
<important3/>

And to (mostly) address content on script tag lines but outside tags: 并且（主要）解决脚本标记行上的内容但外部标记：

~/test£ perl -ne 'print if !(/<script>/../<\/script>/);print "$1\n" if /(.+)<script>/;print "$1\n" if /<\/script>(.+)/;' example.html
<important1/>
<edgecase1/>
<edgecase2/>
<important2/>
<important3/>

Perl正则表达式在线PCRE测试员工作但不在perl命令中

问题描述

2 个解决方案

解决方案1
10 已采纳 2018-02-28 19:41:20

解决方案2
3 2018-02-28 23:40:44

Perl正则表达式在线PCRE测试员工作但不在perl命令中

问题描述

2 个解决方案

解决方案1 10 已采纳 2018-02-28 19:41:20

解决方案2 3 2018-02-28 23:40:44

解决方案1
10 已采纳 2018-02-28 19:41:20

解决方案2
3 2018-02-28 23:40:44