简体   繁体   中英

How to Apply Regular Expression to BeautifulSoup with Python using find_All()

So I am trying to scrape a website with multiple pages. Each page has multiple </table> tags with ids ranging from 19 to 29. the number of tables on each page is random

Here is an example:

page 1 HTML

<table id='table20'>...</table>
<table id='table25'>...</table>

page 2 HTML

<table id='table19'>...</table>
<table id='table21'>...</table>
<table id='table29'>...</table>

page 3 HTML

<table id='table19'>...</table>
<table id='table20'>...</table>
<table id='table21'>...</table>

....

page n HTML

<table id='table19'>...</table>

I am trying to isolate these tables from the html pages, in order to scrape them. So far, I am able to loop through each page, but the regex that I wrote in order to extract the tables from each page don't seem to work. Please help me.

Here is my code:

tables = soup.find_all('table', id = re.compile('^table\d(19|2[0-9])'))

You can use regex expression 'table[12]\\d' ( regex101 ):

data = '''<table id='table19'><tr></tr></table>
<table id='table20'><tr></tr></table>
<table id='table21'><tr></tr></table>

<table id='table40'><tr></tr></table>'''

from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(data, 'html.parser')

for table in soup.find_all('table', {'id':re.compile(r'table[12]\d')}):
    print(table)

Prints:

<table id="table19"><tr></tr></table>
<table id="table20"><tr></tr></table>
<table id="table21"><tr></tr></table>

EDIT: For table 19 or 20-29 use non-capturing group ( regex101 ):

for table in soup.find_all('table', {'id':re.compile(r'table(?:19|2\d)')}):
    print(table)

If that id start string is unique to the tables of interest could you not use attribute = value css selector and starts with operator?

for table in soup.select('table[id^=table]'):
    #do something with table

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM