简体   繁体   English

无法解析 python 中的页码

[英]Having trouble parsing through page numbers in python

I'm trying to build my first web scrapper but I can't figure out how to stop my program from looking for "next-page" links.我正在尝试构建我的第一个 web 抓取工具,但我不知道如何阻止我的程序查找“下一页”链接。

#get URLs for all pages
def page_parse(main_url, url_list):
    page = requests.get(main_url);
    soup = BeautifulSoup(page.content, 'html.parser');
    #check if next page button inactive
    if soup.find('a.next.ajax-page', href=True) == None:
        print('debug');
        return url_list;
    next_page = soup.select_one('a.next.ajax-page', href=True)['href']
    next_page = (f'http://www.yellowpages.com{next_page}')
    url_list.append(next_page);
    print(str(url_list))
    page_parse(next_page, url_list);
    return url_list;

I know what the error is I just have no idea how to check if the "next page" button is active.我知道错误是什么我只是不知道如何检查“下一页”按钮是否处于活动状态。 I've tried looking for differences in the html between the first and last page's "next page" buttons (first page uses a.next.ajax-page while the last uses div.next).我尝试在第一页和最后一页的“下一页”按钮之间寻找 html 的差异(第一页使用 a.next.ajax-page,而最后一页使用 div.next)。 Depending on what I change around my code either hits the print('debug') or gets to the last page and hits a TypeError [see below].根据我对代码所做的更改,要么点击 print('debug'),要么到达最后一页并点击 TypeError [见下文]。 I think the issue is not being able to check if an element exists without calling it.我认为问题在于不调用元素就无法检查元素是否存在。

Error code:错误代码:

['http://www.yellowpages.com/omaha-ne/towing?page=2']
['http://www.yellowpages.com/omaha-ne/towing?page=2', 'http://www.yellowpages.com/omaha-ne/towing?page=3']
['http://www.yellowpages.com/omaha-ne/towing?page=2', 'http://www.yellowpages.com/omaha-ne/towing?page=3', 'http://www.yellowpages.com/omaha-ne/towing?page=4']
['http://www.yellowpages.com/omaha-ne/towing?page=2', 'http://www.yellowpages.com/omaha-ne/towing?page=3', 'http://www.yellowpages.com/omaha-ne/towing?page=4', 'http://www.yellowpages.com/omaha-ne/towing?page=5']
['http://www.yellowpages.com/omaha-ne/towing?page=2', 'http://www.yellowpages.com/omaha-ne/towing?page=3', 'http://www.yellowpages.com/omaha-ne/towing?page=4', 'http://www.yellowpages.com/omaha-ne/towing?page=5', 'http://www.yellowpages.com/omaha-ne/towing?page=6']
['http://www.yellowpages.com/omaha-ne/towing?page=2', 'http://www.yellowpages.com/omaha-ne/towing?page=3', 'http://www.yellowpages.com/omaha-ne/towing?page=4', 'http://www.yellowpages.com/omaha-ne/towing?page=5', 'http://www.yellowpages.com/omaha-ne/towing?page=6', 'http://www.yellowpages.com/omaha-ne/towing?page=7']
Traceback (most recent call last):
  File "c:\Users\-\Documents\code\Python Projects\webscrape2.py", line 49, in <module>  
    url_list = page_parse(main_url, url_list);
  File "c:\Users\-\Documents\code\Python Projects\webscrape2.py", line 19, in page_parse
    page_parse(next_page, url_list);
  File "c:\Users\-\Documents\code\Python Projects\webscrape2.py", line 19, in page_parse
    page_parse(next_page, url_list);
  File "c:\Users\-\Documents\code\Python Projects\webscrape2.py", line 19, in page_parse
    page_parse(next_page, url_list);
  [Previous line repeated 3 more times]
  File "c:\Users\-\Documents\code\Python Projects\webscrape2.py", line 15, in page_parse
    next_page = soup.select_one('a.next.ajax-page', href=True)['href']
TypeError: 'NoneType' object is not subscriptable

Sorry if this is confusing this is my first time posting a question.抱歉,如果这令人困惑,这是我第一次发布问题。

The problem here is that you are trying to access a NoneType variable.这里的问题是您正在尝试访问NoneType变量。 next_page = soup.select_one('a.next.ajax-page', href=True) return nothing so you cant access ['href'] inside next_page = soup.select_one('a.next.ajax-page', href=True)不返回任何内容,因此您无法访问['href']内部

What happens?怎么了?

Your selection soup.find('a.next.ajax-page', href=True) is not finding the element you are searching for in any way cause it is a mix of syntaxes (find and css selectors) and will always return None - So it also won't be able accessing the attribute value.您的选择soup.find('a.next.ajax-page', href=True)没有以任何方式找到您正在搜索的元素,因为它是语法的混合(find 和 css 选择器)并且将始终返回None - 所以它也无法访问属性值。

How to fix?怎么修?

Change your line checking the next page element from:更改检查下一页元素的行:

if soup.find('a.next.ajax-page', href=True) == None:

to:至:

if soup.find('a',{'class':'next ajax-page'}) == None:

or或者

if soup.select_one('a.next.ajax-page') == None:

You also should be able to scrape all basic information of the search results and store these in one step instead of returning a list of urls for search pages:您还应该能够抓取搜索结果的所有基本信息并将其存储在一个步骤中,而不是返回搜索页面的 url 列表:

def page_parse(url):
    data = []
    while True:
        page = requests.get(url)
        soup = BeautifulSoup(page.text)
        for item in soup.select('div.result'):
            data.append({
                'title':item.h2.text,
                'url':f"{baseUrl}{item.a['href']}"
            })

        if (url := soup.select_one('a.next.ajax-page')):
            url = f"{baseUrl}{url['href']}"
        else:
            return data

Example例子

import requests
from bs4 import BeautifulSoup

baseUrl = 'http://www.yellowpages.com'

def page_parse(url):
    data = []
    while True:
        page = requests.get(url)
        soup = BeautifulSoup(page.text)
        for item in soup.select('div.result'):
            data.append({
                'title':item.h2.text,
                'url':f"{baseUrl}{item.a['href']}"
            })

        if (url := soup.select_one('a.next.ajax-page')):
            url = f"{baseUrl}{url['href']}"
        else:
            return data

page_parse('http://www.yellowpages.com/omaha-ne/towing')

Output Output

[{'title': "1. Keith's BP",
  'url': 'http://www.yellowpages.com/omaha-ne/mip/keiths-bp-460502890?lid=1002059325385'},
 {'title': '2. Neff Towing Svc',
  'url': 'http://www.yellowpages.com/omaha-ne/mip/neff-towing-svc-21969600?lid=1000282974083#gallery'},
 {'title': '3. A & A Towing',
  'url': 'http://www.yellowpages.com/omaha-ne/mip/a-a-towing-505777665?lid=1002056319136'},
 {'title': '4. Cross Electronic Recycling',
  'url': 'http://www.yellowpages.com/omaha-ne/mip/cross-electronic-recycling-473693798?lid=1000236876513'},
 {'title': '5. 24 Hour Towing',
  'url': 'http://www.yellowpages.com/omaha-ne/mip/24-hour-towing-521607477?lid=1001918028003'},
 {'title': '6. A & A Towing Fast Friendly',
  'url': 'http://www.yellowpages.com/omaha-ne/mip/a-a-towing-fast-friendly-478453697?lid=1000090213043'},
 {'title': '7. Austin David Towing',
  'url': 'http://www.yellowpages.com/omaha-ne/mip/austin-david-towing-465037110?lid=1001788338357'},...]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM