簡體   English   中英

通過BeautifulSoup分頁進行網頁抓取

[英]Web scraping through pagination with BeautifulSoup

我正在從Bodybuilding.com網站抓取某個課程項目的數據,我的目標是抓取會員信息。 我成功地在第一頁上抓取了20位成員的信息。 當我轉到第二頁時,會出現問題。 下面突出顯示的部分顯示索引21到40重復了索引1到20的信息。而且,我不知道為什么。

我認為第28行(強制)將更新變量及其存儲的信息。 但這似乎並沒有改變。 這與網站結構有關嗎?

我將不勝感激,謝謝。

from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.common.exceptions import NoSuchElementException
import time
import json

data = {}

browser = webdriver.Chrome()
url = "https://bodyspace.bodybuilding.com/member-search"
browser.get(url)

html = browser.page_source
soup = BeautifulSoup(html, "html.parser")

# Going through pagination
pages_remaining = True
counter = 1
index = 0

while pages_remaining:

    if counter == 60:
        pages_remaining = False

    # FETCH AGE, HEIGHT, WEIGHT, & FITNESS GOAL

    **metrics = soup.findAll("div", {"class": "bbcHeadMetrics"})**

    for x in range(0, len(metrics)):
        metrics_children = metrics[index].findChildren()

        details = soup.findAll("div", {"class": "bbcDetails"})
        individual_details = details[index].findChildren()

        if len(individual_details) > 16:
            print ("index: " + str(counter) + " / Age: " + individual_details[2].text + " / Height: " + individual_details[4].text + " / Weight: " + individual_details[7].text + " / Gender: " + individual_details[12].text + " / Goal: " + individual_details[18].text)
        else:
            print ("index: " + str(counter) + " / Age: " + individual_details[2].text + " / Height: " + individual_details[4].text + " / Weight: " + individual_details[7].text + " / Gender: " + individual_details[12].text + " / Goal: " + individual_details[15].text)

        index = index + 1
        counter = counter + 1

    try:
        # Go to page 2
        next_link = browser.find_element_by_xpath('//*[@title="Go to page 2"]')
        next_link.click()
        index = 0
        time.sleep(30)
    except NoSuchElementException:
        rows_remaining = False

有必要更新變量html和湯。

try:
    # Go to page 2
    next_link = browser.find_element_by_xpath('//*[@title="Go to page 2"]')
    next_link.click()
    index = 0

    # update html and soup
    html = browser.page_source
    soup = BeautifulSoup(html, "html.parser")

    time.sleep(30)

except NoSuchElementException:
    rows_remaining = False

我相信您必須這樣做,因為URL不會更改,並且正在使用javascript動態生成html。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM