简体   繁体   中英

How to combine a single regex group with multiple subsequent groups

I'm modifying an existing Python script that extracts text from HTML schedules using regex. The script works great except for one situation which looks like this (greatly simplified):

<tr>
   <td class="month">September</td>
   <td class="date">1</td>
   <td class="date">8</td>
   <td class="date">15<td>
</tr>

I want to return:

('September', '1'),
('September', '8'),
('September', '15'), 

...with a single regex. Writing a regex to capture the groups is trivial. I just can't figure out how to create the desired output with regex. I've tried multiple combinations of lookaround, backreferences, etc. I assume this is straightforward but just can't find the correct regex. Any help is appreciated.

Also, I am fully aware that using regex on HTML text is not the best approach but this legacy system works well and just needs to handle this one case.

Similarly, I know I could return the individual groups and easily create the tuples in Python. That kind of post-processing just doesn't fit well with the existing script.

regex is not recommended for trying to parse HTML. There will always be more than one "special case" that will trip up your expression. Even if the required output was possible in a single regex expression, the code would not be easy to maintain if the HTML changes at a later date.

It normal approach to such a problem would be to use BeautifulSoup to do this. For the HTML you have provided, this could be done as follows:

from bs4 import BeautifulSoup

html = """<tr>
   <td class="month">September</td>
   <td class="date">1</td>
   <td class="date">8</td>
   <td class="date">15</td>
</tr>"""

soup = BeautifulSoup(html, "html.parser")

month = soup.find('td', class_='month').text
dates = [(month, date.text) for date in soup.find_all('td', class_='date')]

print(dates)

This would display:

[('September', '1'), ('September', '8'), ('September', '15')]    

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM