简体   繁体   English

Perl正则表达式在线PCRE测试员工作但不在perl命令中

[英]Perl regex working in online PCRE tester but not in perl command

I've written the following PCRE regex to strip scripts from HTML pages: <script.*?>[\\s\\S]*?< *?\\/ *?script *?> 我编写了以下PCRE正则表达式来从HTML页面中删除脚本: <script.*?>[\\s\\S]*?< *?\\/ *?script *?>

It works on many online PCRE regex testers: 它适用于许多在线PCRE正则表达式测试人员:

https://regex101.com/r/lsxyI6/1 https://regex101.com/r/lsxyI6/1

https://www.regextester.com/?fam=102647 https://www.regextester.com/?fam=102647

It does NOT work when I run the following perl substitution command in a bash terminal: cat tmp.html | perl -pe 's/<script.*?>[\\s\\S]*?< *?\\/ *?script *?>//g' 当我运行在bash终端下面的Perl替换命令它工作: cat tmp.html | perl -pe 's/<script.*?>[\\s\\S]*?< *?\\/ *?script *?>//g' cat tmp.html | perl -pe 's/<script.*?>[\\s\\S]*?< *?\\/ *?script *?>//g'

I am using the following test data: 我使用以下测试数据:

<script>
                       $(document).ready(function() {
                           var url = window.location.href;
                           var element = $('ul.nav a').filter(function() {
                               if (url.charAt(url.length - 1) == '/') {
                                   url = url.substring(0, url.length - 1);
                               }

                               return this.href == url;
                           }).parent();

                           if (element.is('li')) {
                               element.addClass('active');
                           }
                       });
                   </script>

PS I am using regex to parse HTML because the HTML parser I am forced to use (xmlpath) breaks when there are complex scripts on the page. PS我正在使用正则表达式解析HTML,因为当页面上有复杂的脚本时,我被迫使用的HTML解析器(xmlpath)会中断。 I am using this regex to remove scripts from the page before passing it to the parser. 我正在使用此正则表达式从页面中删除脚本,然后将其传递给解析器。

You need to tell perl not to break up each line of the file into its own separate record with -0 . 你需要告诉perl不要将文件的每一行拆分成它自己的单独记录-0

 perl -0 -pe 's/<script.*?>[\s\S]*?< *?\/ *?script *?>//g' tmp.html

This actually tells perl to break up records on '\\0' . 这实际上告诉perl打破'\\0'上的记录。 perl -0777 will very explicitly slurp the whole file. perl -0777将非常明确地perl -0777整个文件。

By the way, because I find slurping whole files distasteful, and because I don't care what html has to say about line breaks...a quicker, cleaner, more correct way to do this IF you can guarantee there is no important content on <script> tag lines is: 顺便说一句,因为我发现整个文件都令人讨厌,而且因为我不关心html对于换行的说法... 如果你能保证没有重要的内容,那么更快,更清洁,更正确的方法在<script>标记行上是:

perl -ne 'print if !(/<script>/../<\/script>/)' tmp.html

(modifying the two regexes to your fancy, of course) .. is a stateful operator that is flipped on by the expression before it being true and off by the one after being true. (当然,将两个正则表达式修改为你的想法) ..是一个有状态的运算符,它在表达式被真实之前被表达式翻转,并且在被表达式之后由真正的表达式关闭。

~/test£ cat example.html
<important1/>
<edgecase1/><script></script><edgecase2/>
<important2/>
<script></script>
<important3/>
<script>
<notimportant/>
</script>

~/test£ perl -ne 'print if !(/<script>/../<\/script>/)' example.html
<important1/>
<important2/>
<important3/>

And to (mostly) address content on script tag lines but outside tags: 并且(主要)解决脚本标记行上的内容但外部标记:

~/test£ perl -ne 'print if !(/<script>/../<\/script>/);print "$1\n" if /(.+)<script>/;print "$1\n" if /<\/script>(.+)/;' example.html
<important1/>
<edgecase1/>
<edgecase2/>
<important2/>
<important3/>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM