简体   繁体   English

Re.match不限制网址

[英]Re.match does not restrict urls

I would like to get only those school URLs in the table on this wiki page that lead to a page with information. 我想只在这个维基页面上的表格中获得那些导致包含信息的页面的学校URL。 The bad urls are colored red contain the phrase 'page does not exist' in side the 'title' attr. 坏网址为红色,包含“标题”attr旁边的“页面不存在”这一短语。 I am trying to use re.match() to filter the URLs such that I only return those which do not contain the aforementioned string. 我试图使用re.match()来过滤URL,这样我只返回那些不包含上述字符串的URL。 Why isn't re.match() working? 为什么re.match()不工作?

URL: 网址:

districts_page = 'https://en.wikipedia.org/wiki/List_of_school_districts_in_Alabama'

FUNCTION: 功能:

def url_check(url):

    all_urls = []

    r = requests.get(url, proxies = proxies)
    html_source = r.text
    soup = BeautifulSoup(html_source)

    for link in soup.find_all('a'):
        if type(link.get('title')) == str:
            if re.match(link.get('title'), '(page does not exist)') == None: 
                all_urls.append(link.get('href'))
            else: pass

    return 

This does not address fixing the problem with re.match , but may be a valid approach for you without using regex: 这不能解决使用re.match解决问题的re.match ,但如果不使用正则表达式,可能是一种有效的方法:

  for link in soup.find_all('a'):
    title = link.get('title')
    if title:
      if not 'page does not exist' in title: 
        all_urls.append(link.get('href'))

The order of the arguments to re.match should be the pattern then the string. re.match的参数re.match应该是模式,然后是字符串。 So try: 所以尝试:

    if not re.search(r'(page does not exist)', link.get('title')): 

(I've also changed re.match to re.search since -- as @goldisfine observed -- the pattern does not occur at the beginning of the string.) (我也将re.match改为re.search因为@goldisfine观察到 - 模式不会出现在字符串的开头。)


Using @kindall's observation, your code could also be simplified to 使用@ kindall的观察,您的代码也可以简化为

for link in soup.find_all('a', 
        title=lambda x: x is not None and 'page does not exist' not in x):
    all_urls.append(link.get('href'))

This eliminates the two if-statements . 这消除了两个if-statements It can all be incorporated into the call to soup.find_all . 它可以全部纳入对soup.find_all的调用中。

Unutbu's answer addresses the syntax error. Unutbu的答案解决了语法错误。 But simply using re.match() is not enough. 但仅仅使用re.match()是不够的。 Re.match looks at the beginning of the string. Re.match查看字符串的开头。 re.search() goes through the entire string until it happens upon a section of the string that matches the entered pattern. re.search()遍历整个字符串,直到它出现在匹配输入模式的字符串部分。

The following code works: 以下代码有效:

for link in soup.find_all('a'):
    if type(link.get('title')) == str:
        if re.search('page does not exist',link.get('title')) == None: 
            all_urls.append(link.get('href'))
return all_urls

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM