I need to match different script tags which for example like this
<script src="//ajax.googleapis.com/ajax/libs/jquery/1.11.1/jquery.min.js"></script>
<script type="text/javascript">
jQuery(document).ready(function()
{
jQuery("#gift_cards").tooltip({ effect: \'slide\'});
});
</script>
<script>dasdfsfsdf</script>
Also i need to get the tags only and the src content in groups I created a regex
(<\s*?script[\s\S]*?(?:src=['"](\S+?)['"])?\B[\S\s]*?>)([\s\S]*?)(</script>)
This is not matching the last script tag
Whats wrong with it?
EDIT: Removing the \\B does match all the script tags but then i donot get the contents of the src attribute in a separate group. What I need to do is from a group of script tags of two categories
I need to remove the script opening and closing tags but keep the content inside of the tag If its of the first type I still need to remove the tags but keep the path in a seperate table Hope that clarifies it much more
As iCodez' link so entertainingly shows, HTML should not be parsed by regex, as HTML is not a regular language. Instead, try using a parser such as BeautifulSoup
. Make sure you also install lxml
and html5lib
as well for best performance and access to all the features.
pip install lxml html5lib beautifulsoup4
should do the trick.
Provided that I agree with all the remarks about not parsing HTML with RegExp and also provided that I myself indulge in such evil practice when I'm confident that the documents I will process are regular enough, try removing the \\B
, in my test it matches all three scripts.
What is for, by the way, this "non boundary"? I'm not sure I understood why you inserted it. If it was necessary for some reason I do not grasp please tell me and we'll try to find another way.
Edit: In order to retain the src content try
(<\s*?script[\s\S]*?(?:(?:src=[\'"](.*?)[\'"])(?:[\S\s]*?))?>)([\s\S]*?)(</script>)
This works for me, check against your other samples. Consider that your first [\\s\\S]*?
already matches everything till >
when you do not have a "src" attribute, so the second one only makes sense if "src" is there and you want to match other possible attributes.
For giggles, here's a super-simple way that I figured out by complete accident (as a js string which would be fed to the RegExp constructor:
'src=(=|=")' + yourPathHere + '[^<]<\\/script>'
where yourPathHere has had forward slashes escaped; so, as a pure RE, something like:
/src=(=|=")/scripts/someFolder/script.js[^<]</script>/
which I'm using in a gulp task whilst I'm trying to figure out gulp streams :[]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.