I created a regex that will grab multiple different data from an html page. It uses grouped alternatives within a non-capture group. It works really well to grab the needed data; however, the groups are not combined in as few matches as possible tester
While coding it up, I thought the matches and groups seemed a little weird with the online regex tester, but it wasn't until I got it working in python that I noticed the issue with my groups heirarchy.
My only solutions appear to be to...
Number 1 above would be better out of the ideas. I don't want to "pollute" my code base with unneeded code
^.*(?:<p.*>(.*)|<span>(.*)<\/span>|<a href=\"(.*linux(?:\.tar\.gz|.zip))\">(.*)</a>.*\((.*) bytes\))
Playground https://regex101.com/r/MC8TOv/1/
Webscraped Site https://android-dot-devsite-v2-prod.appspot.com/studio/archive_25350a46834ddb86754aba2445ff1359aa7fd8cb296923255092494ac94ef531.frame
While using BeautifulSoup
import re
...
regex = r"^.*(?:<p.*>(.*)|<span>(.*)<\/span>|<a href=\"(.*linux(?:.tar.gz|.zip))\">(?P<filename>.*)</a>.*\((.*) bytes\))"
match = re.findall(regex, str(soup_html), re.M)
print(match)
What I am getting This is some generic output I am getting.
[
('A1', '', '', '', ''),
('', 'B1', '', '', ''),
('', '', 'C1', 'D1', 'E1'),
('A2', '', '', '', ''),
('', 'B2', '', '', ''),
('', '', 'C2', 'D2', 'E2'),
...
]
What I want
[
('A1', 'B1', 'C1', 'D1', 'E1'),
('A2', 'B2', 'C2', 'D2', 'E2')
...
]
Again, is there are way to rewrite the regex to have 5 matched groups per match?
If the five things have to appear in a sequence as your example suggests, combine them with .*?
rather than |
:
(regex1).*?(regex2).*?(regex3).*(regex4).*?(regex5)
instead of
(regex1)|(regex2)|(regex3).*(regex4).*?(regex5)
If, on the other hand, they don't necessarily have to appear in this order, then I don't quite see how you expect to shoehorn them into the rigid five-group structure.
Either way, there's also the possibility of post-processing the results after you've applied the regex.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.