简体   繁体   English

正则表达式以匹配python中的不同脚本标签

[英]Regular Expression to match different script tags in python

I need to match different script tags which for example like this 我需要匹配不同的脚本标签,例如这样

 <script src="//ajax.googleapis.com/ajax/libs/jquery/1.11.1/jquery.min.js"></script>
 <script type="text/javascript">
 jQuery(document).ready(function()
 {
    jQuery("#gift_cards").tooltip({ effect: \'slide\'});
 });
 </script>
 <script>dasdfsfsdf</script>

Also i need to get the tags only and the src content in groups I created a regex 我也只需要获取标签和创建正则表达式的组中的src内容

(<\s*?script[\s\S]*?(?:src=['"](\S+?)['"])?\B[\S\s]*?>)([\s\S]*?)(</script>)

This is not matching the last script tag 这与最后一个脚本标签不匹配

Whats wrong with it? 它出什么问题了?

EDIT: Removing the \\B does match all the script tags but then i donot get the contents of the src attribute in a separate group. 编辑:删除\\ B确实匹配所有脚本标签,但是后来我没有在单独的组中获得src属性的内容。 What I need to do is from a group of script tags of two categories 我需要做的是从两个类别的脚本标签组中

  1. One with an src attribute with the path to the actual script 一个带有src属性的脚本,该脚本具有实际脚本的路径
  2. Second without src attribute with normal inline javascript 第二个没有src属性的普通内联javascript

I need to remove the script opening and closing tags but keep the content inside of the tag If its of the first type I still need to remove the tags but keep the path in a seperate table Hope that clarifies it much more 我需要删除脚本的打开和关闭标签,但将内容保留在标签内如果是第一种类型,我仍然需要删除标签,但将路径保存在单独的表中,希望可以进一步阐明它

As iCodez' link so entertainingly shows, HTML should not be parsed by regex, as HTML is not a regular language. 正如iCodez的链接很有趣地显示的那样,正则表达式不应该解析HTML,因为HTML不是一种常规语言。 Instead, try using a parser such as BeautifulSoup . 而是尝试使用诸如BeautifulSoup的解析器。 Make sure you also install lxml and html5lib as well for best performance and access to all the features. 确保还安装了lxmlhtml5lib以获得最佳性能并访问所有功能。

pip install lxml html5lib beautifulsoup4

should do the trick. 应该可以。

Provided that I agree with all the remarks about not parsing HTML with RegExp and also provided that I myself indulge in such evil practice when I'm confident that the documents I will process are regular enough, try removing the \\B , in my test it matches all three scripts. 如果我同意不使用RegExp解析HTML的所有评论,并且还提供了我自己确信自己将处理的文档足够常规的情况,那么我自己可以沉迷于这种邪恶的做法,请尝试删除\\B ,在我的测试中匹配所有三个脚本。

What is for, by the way, this "non boundary"? 顺便说一下,这个“无边界”是什么意思? I'm not sure I understood why you inserted it. 我不确定我理解您为什么插入它。 If it was necessary for some reason I do not grasp please tell me and we'll try to find another way. 如果出于某种原因有必要我不明白,请告诉我,我们将尝试寻找另一种方法。

Edit: In order to retain the src content try 编辑:为了保留src内容,请尝试

(<\s*?script[\s\S]*?(?:(?:src=[\'"](.*?)[\'"])(?:[\S\s]*?))?>)([\s\S]*?)(</scrip‌​t>)

This works for me, check against your other samples. 这对我有用,请对照您的其他样本。 Consider that your first [\\s\\S]*? 考虑您的first [\\s\\S]*? already matches everything till > when you do not have a "src" attribute, so the second one only makes sense if "src" is there and you want to match other possible attributes. 在没有“ src”属性时已经匹配了所有内容,直到>为止,因此第二个条件只有在“ src”存在并且您要匹配其他可能的属性时才有意义。

For giggles, here's a super-simple way that I figured out by complete accident (as a js string which would be fed to the RegExp constructor: 对于傻笑,这是我完全偶然发现的一种超简单方式(作为js字符串,该字符串将被馈送到RegExp构造函数中:

'src=(=|=")' + yourPathHere + '[^<]<\\/script>' 'src =(= | =“)'+ yourPathHere +'[^ <] <\\ / script>'

where yourPathHere has had forward slashes escaped; yourPathHere的正斜杠已转义; so, as a pure RE, something like: 因此,作为纯RE,类似:

/src=(=|=")/scripts/someFolder/script.js[^<]</script>/ /src=(=|=")/scripts/someFolder/script.js[^<]</script>/

which I'm using in a gulp task whilst I'm trying to figure out gulp streams :[] 在尝试找出gulp流时,我在gulp任务中使用的::]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM