繁体   English   中英

通过单击下一步来抓取具有不变网址的多个网页

[英]scrape multiple web pages with unchanging url by clicking next

我正在尝试从https://ethnicelebs.com/all-celebs这个网站上抓取多个页面,但是每个页面的URL都保持不变。 我想从所有页面中抓取所有字符的种族信息(单击列出的名称时)。

导航之后,我使用以下代码来刮擦种族信息,但它一直在刮擦第一页:

url = 'https://ethnicelebs.com/all-celeb'
driver = webdriver.Chrome()
driver.get(url)
while True:

    page = requests.post(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    for href in soup.find_all('a', href=True)[18:]:
        print('Found the URL:{}'.format(href['href']))
        request_href = requests.get(href['href'])
        soup2 = BeautifulSoup(request_href.content)
        for each in soup2.find_all('strong')[:-1]:
            print(each.text)

    Next_button = (By.XPATH, "//*[@title='Go to next page']")
    WebDriverWait(driver, 50).until(EC.element_to_be_clickable(Next_button)).click()
    url = driver.current_url
    time.sleep(5)

(感谢@Sureshmani!)

在不断导航而不是回到第一页的同时,我该如何抓取当前页面? 谢谢!

我可以使用这种方法导航到每个页面,

    url = 'https://ethnicelebs.com/all-celeb'
  driver.get(url)
time.sleep(5)
i = 1
while i<52:
    myxpath = "//div[contains(@data-id,'pt-cv-page-"+str(i)+"')]//h6/a"
    print(myxpath)
    CelebLinks = driver.find_elements_by_xpath(myxpath)
    print('Total Celeb displayed in the current page', len(CelebLinks))
    for links in CelebLinks:
        print(str(links.text).encode('UTF-8', 'ignore'))
        print('Link URl: ',links.get_attribute('href'))
        request_href = requests.get(links.get_attribute('href'))
        soup = BeautifulSoup(request_href.content)
        for each in soup.find_all('strong')[:-1]:
            print(str(each.text).encode('UTF-8', 'ignore'))
    Next_button = (By.XPATH, "//*[@title='Go to next page']")
    WebDriverWait(driver, 50).until(EC.element_to_be_clickable(Next_button)).click()
    time.sleep(5)
    i+=1
    continue

您将需要以下导入

    from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
import time
from bs4 import BeautifulSoup
import requests

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM