Re.match不限制网址

Question

我想只在这个维基页面上的表格中获得那些导致包含信息的页面的学校URL。 坏网址为红色，包含“标题”attr旁边的“页面不存在”这一短语。 我试图使用re.match（）来过滤URL，这样我只返回那些不包含上述字符串的URL。 为什么re.match（）不工作？

网址：

districts_page = 'https://en.wikipedia.org/wiki/List_of_school_districts_in_Alabama'

功能：

def url_check(url):

    all_urls = []

    r = requests.get(url, proxies = proxies)
    html_source = r.text
    soup = BeautifulSoup(html_source)

    for link in soup.find_all('a'):
        if type(link.get('title')) == str:
            if re.match(link.get('title'), '(page does not exist)') == None: 
                all_urls.append(link.get('href'))
            else: pass

    return

Answer 1

这不能解决使用re.match解决问题的re.match ，但如果不使用正则表达式，可能是一种有效的方法：

  for link in soup.find_all('a'):
    title = link.get('title')
    if title:
      if not 'page does not exist' in title: 
        all_urls.append(link.get('href'))

Answer 2

re.match的参数re.match应该是模式，然后是字符串。 所以尝试：

    if not re.search(r'(page does not exist)', link.get('title')):

（我也将re.match改为re.search因为@goldisfine观察到 - 模式不会出现在字符串的开头。）

使用@ kindall的观察，您的代码也可以简化为

for link in soup.find_all('a', 
        title=lambda x: x is not None and 'page does not exist' not in x):
    all_urls.append(link.get('href'))

这消除了两个if-statements 。 它可以全部纳入对soup.find_all的调用中。

Answer 3

Unutbu的答案解决了语法错误。 但仅仅使用re.match（）是不够的。 Re.match查看字符串的开头。 re.search()遍历整个字符串，直到它出现在匹配输入模式的字符串部分。

以下代码有效：

for link in soup.find_all('a'):
    if type(link.get('title')) == str:
        if re.search('page does not exist',link.get('title')) == None: 
            all_urls.append(link.get('href'))
return all_urls

Re.match不限制网址

问题描述

3 个解决方案

解决方案1
2 已采纳 2013-08-12 18:18:05

解决方案2
0 2013-08-12 17:46:42

解决方案3
0 2013-08-12 17:53:06

Re.match不限制网址

问题描述

3 个解决方案

解决方案1 2 已采纳 2013-08-12 18:18:05

解决方案2 0 2013-08-12 17:46:42

解决方案3 0 2013-08-12 17:53:06

解决方案1
2 已采纳 2013-08-12 18:18:05

解决方案2
0 2013-08-12 17:46:42

解决方案3
0 2013-08-12 17:53:06