使用 BeautifulSoup 檢查下一頁是否存在

Question

我目前正在學習使用 BeautifulSoup 編寫一個抓取工具。 到目前為止，我下面的代碼工作正常，除了一些問題。 首先，解釋一下我目前正在從 Fold.it 項目中抓取玩家數據。 由於需要抓取多個頁面，因此我一直使用此代碼塊在循環結束時查找下一頁。

   next_link = soup.find(class_='active', title='Go to next page')
   url_next = "http://www.fold.it" + next_link['href'] ### problem line???
   print url_next

不幸的是，有時我會得到這樣的結果：

據我所知，由於某種原因，下一頁鏈接沒有被解析。 我不確定是因為特定的網站、我寫的代碼還是完全不同的東西。 到目前為止，我已經嘗試編寫代碼來檢查它是否返回 NoneType，但它仍然會出錯。

我正在尋找的理想行為是刮到最后一頁。 但是，如果確實發生了錯誤，請重試同一頁面。 我所犯的任何想法、意見或明顯錯誤將不勝感激！

完整代碼如下：

import os
import urllib2
import csv
import time
from bs4 import BeautifulSoup

url_next = 'http://www.fold.it/portal/players/s_all'
url_last = ''

today_string = time.strftime('%m_%d_%Y')
location = '/home/' + 'daily_soloist_' + today_string + '.csv'

mode = 'a' if os.path.exists(location) else 'w'
with open(location, mode) as my_csv:
while True:
    soup = BeautifulSoup(urllib2.urlopen(url_next).read(), "lxml")
    if url_next == url_last:
        print "Scraping Complete"
        break

    for row in soup('tr', {'class':'even'}):
        cells = row('td')

  #current rank
        rank = cells[0].text

  #finds first text node - user name
        name = cells[1].a.find(text=True).strip()

  #separates ranking
        rank1, rank2 = cells[1].find_all("span")

  #total global score
        score = row('td')[2].string

        data = [[int(str(rank[1:])), name.encode('ascii', 'ignore'), int(str(rank1.text)), int(str(rank2.text)), int(str(score))]]

  #writes to csv
        database = csv.writer(my_csv, delimiter=',')
        database.writerows(data)  


   next_link = soup.find(class_='active', title='Go to next page')
   url_next = "http://www.fold.it" + next_link['href'] ### problem line???
   print url_next

   last_link = soup.find(class_='active', title = 'Go to last page')
   url_last = "http://www.fold.it" + last_link['href']

Answer 1

要進行修復，您可以輸入以下try: except: block。 （您應該添加比我更多的錯誤處理）如果嘗試失敗，則不要更改url_next值。 但是要小心，如果您在同一頁面上遇到錯誤，您將陷入無限循環。

try:
    if url_next == url_last:
        print "Scraping Complete"
        break

    for row in soup('tr', {'class':'even'}):
        cells = row('td')

        #current rank
        rank = cells[0].text

        #finds first text node - user name
        name = cells[1].a.find(text=True).strip()

        #separates ranking
        rank1, rank2 = cells[1].find_all("span")

        #total global score
        score = row('td')[2].string

        data = [[int(str(rank[1:])), name.encode('ascii', 'ignore'), int(str(rank1.text)), int(str(rank2.text)), int(str(score))]]

        #writes to csv
        database = csv.writer(my_csv, delimiter=',')
        database.writerows(data)  


    next_link = soup.find(class_='active', title='Go to next page')
    url_next = "http://www.fold.it" + next_link['href'] ### problem line???

except:  #if the above bombs out, maintain the same url_next
    print "problem with this page, try again"

print url_next

使用 BeautifulSoup 檢查下一頁是否存在

問題描述

1 個解決方案

解決方案1
0 2015-11-10 01:06:42

使用 BeautifulSoup 檢查下一頁是否存在

問題描述

1 個解決方案

解決方案1 0 2015-11-10 01:06:42

解決方案1
0 2015-11-10 01:06:42