為什么我的代碼在抓取時陷入無限循環？

Question

I am learning how to do basic web scraping with Python 3, and in this example I was trying to scrape all the author names from the website http://quotes.toscrape.com . 我試圖創建一個代碼，但我不知道網站上的頁面總數。 但是，當我嘗試構建它時，編輯器沒有響應。 代碼有問題，還是應該讓它運行更長時間？

import requests
import bs4
i = 0
authors = set()
while True:
    try:
        if i == 0:
            url = "http://quotes.toscrape.com"
        else: 
            url = "http://quotes.toscrape.com/page/{}/".format(i+1)
        
        res = requests.get(url)
        soup = bs4.BeautifulSoup(res.text, 'lxml')
        
        for name in soup.select('.author'):
            authors.add(name.text)
            
        
        i += 1
        
    except:
        break

Answer 1

我相信這個問題與該網站如何返回有效響應有關，即使該頁碼中沒有引號（例如嘗試http://quotes.toscrape.com/page/23400/ ）。 因此，您很可能永遠不會（或至少需要很長時間才能）遇到任何會導致您的 break 語句的錯誤。 相反，您應該嘗試在遇到諸如“未找到引號”之類的文本時嘗試中斷。 例如：：

import requests
import bs4
i = 0
authors = set()
while True:
    try:
        if i == 0:
            url = "http://quotes.toscrape.com"
        else: 
            url = "http://quotes.toscrape.com/page/{}/".format(i+1)
    
        res = requests.get(url)
        soup = bs4.BeautifulSoup(res.text, 'lxml')

        if "No quotes found!" in str(soup):
            break
    
        for name in soup.select('.author'):
            authors.add(name.text)
        
    
        i += 1
    
    except:
        break

Answer 2

嘗試：

import requests
import bs4

i = 0
authors = set()

while True:

    url = "http://quotes.toscrape.com" if i == 0 else \
         f"http://quotes.toscrape.com/page/{i}/"

    res = requests.get(url)

    if res.text.find('No quotes found!') < 0:
        soup = bs4.BeautifulSoup(res.text, 'lxml')
        for name in soup.select('.author'):
            authors.add(name.text)
        i += 1
    else:
        break

為什么我的代碼在抓取時陷入無限循環？

問題描述

2 個解決方案

解決方案1
2 已采納 2021-06-02 13:34:29

解決方案2
0 2021-06-02 13:33:01

為什么我的代碼在抓取時陷入無限循環？

問題描述

2 個解決方案

解決方案1 2 已采納 2021-06-02 13:34:29

解決方案2 0 2021-06-02 13:33:01

解決方案1
2 已采納 2021-06-02 13:34:29

解決方案2
0 2021-06-02 13:33:01