通過BeautifulSoup分頁進行網頁抓取

Question

我正在從Bodybuilding.com網站抓取某個課程項目的數據，我的目標是抓取會員信息。 我成功地在第一頁上抓取了20位成員的信息。 當我轉到第二頁時，會出現問題。 下面突出顯示的部分顯示索引21到40重復了索引1到20的信息。而且，我不知道為什么。

我認為第28行（強制）將更新變量及其存儲的信息。 但這似乎並沒有改變。 這與網站結構有關嗎？

我將不勝感激，謝謝。

from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.common.exceptions import NoSuchElementException
import time
import json

data = {}

browser = webdriver.Chrome()
url = "https://bodyspace.bodybuilding.com/member-search"
browser.get(url)

html = browser.page_source
soup = BeautifulSoup(html, "html.parser")

# Going through pagination
pages_remaining = True
counter = 1
index = 0

while pages_remaining:

    if counter == 60:
        pages_remaining = False

    # FETCH AGE, HEIGHT, WEIGHT, & FITNESS GOAL

    **metrics = soup.findAll("div", {"class": "bbcHeadMetrics"})**

    for x in range(0, len(metrics)):
        metrics_children = metrics[index].findChildren()

        details = soup.findAll("div", {"class": "bbcDetails"})
        individual_details = details[index].findChildren()

        if len(individual_details) > 16:
            print ("index: " + str(counter) + " / Age: " + individual_details[2].text + " / Height: " + individual_details[4].text + " / Weight: " + individual_details[7].text + " / Gender: " + individual_details[12].text + " / Goal: " + individual_details[18].text)
        else:
            print ("index: " + str(counter) + " / Age: " + individual_details[2].text + " / Height: " + individual_details[4].text + " / Weight: " + individual_details[7].text + " / Gender: " + individual_details[12].text + " / Goal: " + individual_details[15].text)

        index = index + 1
        counter = counter + 1

    try:
        # Go to page 2
        next_link = browser.find_element_by_xpath('//*[@title="Go to page 2"]')
        next_link.click()
        index = 0
        time.sleep(30)
    except NoSuchElementException:
        rows_remaining = False

Answer 1

有必要更新變量html和湯。

try:
    # Go to page 2
    next_link = browser.find_element_by_xpath('//*[@title="Go to page 2"]')
    next_link.click()
    index = 0

    # update html and soup
    html = browser.page_source
    soup = BeautifulSoup(html, "html.parser")

    time.sleep(30)

except NoSuchElementException:
    rows_remaining = False

我相信您必須這樣做，因為URL不會更改，並且正在使用javascript動態生成html。

通過BeautifulSoup分頁進行網頁抓取

問題描述

1 個解決方案

解決方案1
0 已采納 2018-04-12 01:12:28

通過BeautifulSoup分頁進行網頁抓取

問題描述

1 個解決方案

解決方案1 0 已采納 2018-04-12 01:12:28

解決方案1
0 已采納 2018-04-12 01:12:28