简体   繁体   中英

Re.match does not restrict urls

I would like to get only those school URLs in the table on this wiki page that lead to a page with information. The bad urls are colored red contain the phrase 'page does not exist' in side the 'title' attr. I am trying to use re.match() to filter the URLs such that I only return those which do not contain the aforementioned string. Why isn't re.match() working?

URL:

districts_page = 'https://en.wikipedia.org/wiki/List_of_school_districts_in_Alabama'

FUNCTION:

def url_check(url):

    all_urls = []

    r = requests.get(url, proxies = proxies)
    html_source = r.text
    soup = BeautifulSoup(html_source)

    for link in soup.find_all('a'):
        if type(link.get('title')) == str:
            if re.match(link.get('title'), '(page does not exist)') == None: 
                all_urls.append(link.get('href'))
            else: pass

    return 

This does not address fixing the problem with re.match , but may be a valid approach for you without using regex:

  for link in soup.find_all('a'):
    title = link.get('title')
    if title:
      if not 'page does not exist' in title: 
        all_urls.append(link.get('href'))

The order of the arguments to re.match should be the pattern then the string. So try:

    if not re.search(r'(page does not exist)', link.get('title')): 

(I've also changed re.match to re.search since -- as @goldisfine observed -- the pattern does not occur at the beginning of the string.)


Using @kindall's observation, your code could also be simplified to

for link in soup.find_all('a', 
        title=lambda x: x is not None and 'page does not exist' not in x):
    all_urls.append(link.get('href'))

This eliminates the two if-statements . It can all be incorporated into the call to soup.find_all .

Unutbu's answer addresses the syntax error. But simply using re.match() is not enough. Re.match looks at the beginning of the string. re.search() goes through the entire string until it happens upon a section of the string that matches the entered pattern.

The following code works:

for link in soup.find_all('a'):
    if type(link.get('title')) == str:
        if re.search('page does not exist',link.get('title')) == None: 
            all_urls.append(link.get('href'))
return all_urls

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM