简体   繁体   中英

Flatten regex group matches

Summary

I created a regex that will grab multiple different data from an html page. It uses grouped alternatives within a non-capture group. It works really well to grab the needed data; however, the groups are not combined in as few matches as possible tester

While coding it up, I thought the matches and groups seemed a little weird with the online regex tester, but it wasn't until I got it working in python that I noticed the issue with my groups heirarchy.

My only solutions appear to be to...

  1. Rewrite the regex to have 5 matched groups per match
  2. Write python code to flatten the data structure.
  3. Something else???

Number 1 above would be better out of the ideas. I don't want to "pollute" my code base with unneeded code

Code

Regex

^.*(?:<p.*>(.*)|<span>(.*)<\/span>|<a href=\"(.*linux(?:\.tar\.gz|.zip))\">(.*)</a>.*\((.*) bytes\))

Playground https://regex101.com/r/MC8TOv/1/

Webscraped Site https://android-dot-devsite-v2-prod.appspot.com/studio/archive_25350a46834ddb86754aba2445ff1359aa7fd8cb296923255092494ac94ef531.frame

在此处输入图像描述

Python

While using BeautifulSoup

import re

...

regex = r"^.*(?:<p.*>(.*)|<span>(.*)<\/span>|<a href=\"(.*linux(?:.tar.gz|.zip))\">(?P<filename>.*)</a>.*\((.*) bytes\))"          
match = re.findall(regex, str(soup_html), re.M)

print(match)

Expectations

What I am getting This is some generic output I am getting.

 [
     ('A1', '', '', '', ''),
     ('', 'B1', '', '', ''),
     ('', '', 'C1', 'D1', 'E1'),
     ('A2', '', '', '', ''),
     ('', 'B2', '', '', ''),
     ('', '', 'C2', 'D2', 'E2'),
     ...
 ]

What I want

 [
     ('A1', 'B1', 'C1', 'D1', 'E1'),
     ('A2', 'B2', 'C2', 'D2', 'E2')
     ...
 ]

Again, is there are way to rewrite the regex to have 5 matched groups per match?

If the five things have to appear in a sequence as your example suggests, combine them with .*? rather than | :

(regex1).*?(regex2).*?(regex3).*(regex4).*?(regex5)

instead of

(regex1)|(regex2)|(regex3).*(regex4).*?(regex5)

If, on the other hand, they don't necessarily have to appear in this order, then I don't quite see how you expect to shoehorn them into the rigid five-group structure.

Either way, there's also the possibility of post-processing the results after you've applied the regex.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM