使用BeautifulSoup進行網頁抓取時，如何移動到新頁面？

Question

下面我有將記錄從craigslist中拉出的代碼。 一切工作都很好，但是我需要能夠轉到下一組記錄並重復相同的過程，但是對編程來說我是新手。 通過查看頁面代碼，看起來我應該單擊此處跨度中包含的箭頭按鈕，直到其中不包含href為止：

<a href="/search/syp?s=120" class="button next" title="next page">next &gt; </a>

我以為這可能是一個循環，但是我想這也可能是一種嘗試/例外情況。 聽起來對嗎？ 您將如何實施？

import requests
from urllib.request import urlopen
import pandas as pd

response = requests.get("https://nh.craigslist.org/d/computer-parts/search/syp")

soup = BeautifulSoup(response.text,"lxml")

listings = soup.find_all('li', class_= "result-row")

base_url = 'https://nh.craigslist.org/d/computer-parts/search/'

next_url = soup.find_all('a', class_= "button next")


dates = []
titles = []
prices = []
hoods = []

while base_url !=
    for listing in listings:
        datar = listing.find('time', {'class': ["result-date"]}).text
        dates.append(datar)

        title = listing.find('a', {'class': ["result-title"]}).text
        titles.append(title)

        try:
            price = listing.find('span', {'class': "result-price"}).text
            prices.append(price)
        except:
            prices.append('missing')

        try:
            hood = listing.find('span', {'class': "result-hood"}).text
            hoods.append(hood)
        except:
            hoods.append('missing')

#write the lists to a dataframe
listings_df = pd.DataFrame({'Date': dates, 'Titles' : titles, 'Price' : prices, 'Location' : hoods})

 #write to a file
listings_df.to_csv("craigslist_listings.csv")

Answer 1

對於您抓取的每個頁面，您都可以找到下一個要抓取的網址並將其添加到列表中。

這就是我要做的，而無需過多更改您的代碼。 我添加了一些評論，以便您了解發生了什么，但是如果您需要任何其他說明，請給我評論：

import requests
from urllib.request import urlopen
import pandas as pd
from bs4 import BeautifulSoup


base_url = 'https://nh.craigslist.org/d/computer-parts/search/syp'
base_search_url = 'https://nh.craigslist.org'
urls = []
urls.append(base_url)
dates = []
titles = []
prices = []
hoods = []

while len(urls) > 0: # while we have urls to crawl
    print(urls)
    url = urls.pop(0) # removes the first element from the list of urls
    response = requests.get(url)
    soup = BeautifulSoup(response.text,"lxml")
    next_url = soup.find('a', class_= "button next") # finds the next urls to crawl
    if next_url: # if it's not an empty string
        urls.append(base_search_url + next_url['href']) # adds next url to crawl to the list of urls to crawl

    listings = soup.find_all('li', class_= "result-row") # get all current url listings
    # this is your code unchanged
    for listing in listings:
        datar = listing.find('time', {'class': ["result-date"]}).text
        dates.append(datar)

        title = listing.find('a', {'class': ["result-title"]}).text
        titles.append(title)

        try:
            price = listing.find('span', {'class': "result-price"}).text
            prices.append(price)
        except:
            prices.append('missing')

        try:
            hood = listing.find('span', {'class': "result-hood"}).text
            hoods.append(hood)
        except:
            hoods.append('missing')

#write the lists to a dataframe
listings_df = pd.DataFrame({'Date': dates, 'Titles' : titles, 'Price' : prices, 'Location' : hoods})

 #write to a file
listings_df.to_csv("craigslist_listings.csv")

編輯：您還忘記了在您的代碼中導入BeautifulSoup ，我在響應中添加了代碼Edit2：您只需要查找next按鈕的第一個實例，因為頁面可以（在這種情況下確實）具有一個以上的按鈕下一個按鈕。
Edit3：要對此進行爬網，應將base_url更改為此代碼中存在的那個

Answer 2

這不是如何訪問“下一步”按鈕的直接答案，但這可能是解決您的問題的方法。 過去進行網絡爬蟲時，我使用每個頁面的URL遍歷搜索結果。 在craiglist上，當您單擊“下一頁”時，URL會更改。 您通常可以利用此更改的模式。 我沒有長一看，但它看起來像Craigslist網站的第二頁是： https://nh.craigslist.org/search/syp?s=120 ，第三個是HTTPS：//nh.craigslist。 org / search / syp？s = 240 。 似乎URL的最后部分每次更改120次。 您可以創建120的倍數的列表，然后構建一個for循環以將此值添加到每個URL的末尾。 然后，將當前的for循環嵌套在此for循環中。

使用BeautifulSoup進行網頁抓取時，如何移動到新頁面？

問題描述

2 個解決方案

解決方案1
2 已采納 2018-10-23 14:30:38

解決方案2
1 2018-10-23 14:21:34

使用BeautifulSoup進行網頁抓取時，如何移動到新頁面？

問題描述

2 個解決方案

解決方案1 2 已采納 2018-10-23 14:30:38

解決方案2 1 2018-10-23 14:21:34

解決方案1
2 已采納 2018-10-23 14:30:38

解決方案2
1 2018-10-23 14:21:34