简体   繁体   中英

Regular Expression to match different script tags in python

I need to match different script tags which for example like this

 <script src="//ajax.googleapis.com/ajax/libs/jquery/1.11.1/jquery.min.js"></script>
 <script type="text/javascript">
 jQuery(document).ready(function()
 {
    jQuery("#gift_cards").tooltip({ effect: \'slide\'});
 });
 </script>
 <script>dasdfsfsdf</script>

Also i need to get the tags only and the src content in groups I created a regex

(<\s*?script[\s\S]*?(?:src=['"](\S+?)['"])?\B[\S\s]*?>)([\s\S]*?)(</script>)

This is not matching the last script tag

Whats wrong with it?

EDIT: Removing the \\B does match all the script tags but then i donot get the contents of the src attribute in a separate group. What I need to do is from a group of script tags of two categories

  1. One with an src attribute with the path to the actual script
  2. Second without src attribute with normal inline javascript

I need to remove the script opening and closing tags but keep the content inside of the tag If its of the first type I still need to remove the tags but keep the path in a seperate table Hope that clarifies it much more

As iCodez' link so entertainingly shows, HTML should not be parsed by regex, as HTML is not a regular language. Instead, try using a parser such as BeautifulSoup . Make sure you also install lxml and html5lib as well for best performance and access to all the features.

pip install lxml html5lib beautifulsoup4

should do the trick.

Provided that I agree with all the remarks about not parsing HTML with RegExp and also provided that I myself indulge in such evil practice when I'm confident that the documents I will process are regular enough, try removing the \\B , in my test it matches all three scripts.

What is for, by the way, this "non boundary"? I'm not sure I understood why you inserted it. If it was necessary for some reason I do not grasp please tell me and we'll try to find another way.

Edit: In order to retain the src content try

(<\s*?script[\s\S]*?(?:(?:src=[\'"](.*?)[\'"])(?:[\S\s]*?))?>)([\s\S]*?)(</scrip‌​t>)

This works for me, check against your other samples. Consider that your first [\\s\\S]*? already matches everything till > when you do not have a "src" attribute, so the second one only makes sense if "src" is there and you want to match other possible attributes.

For giggles, here's a super-simple way that I figured out by complete accident (as a js string which would be fed to the RegExp constructor:

'src=(=|=")' + yourPathHere + '[^<]<\\/script>'

where yourPathHere has had forward slashes escaped; so, as a pure RE, something like:

/src=(=|=")/scripts/someFolder/script.js[^<]</script>/

which I'm using in a gulp task whilst I'm trying to figure out gulp streams :[]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM