简体   繁体   English

使用 BeautifulSoup 和 Selenium 抓取一个网站的多个网页的内容

[英]Scraping contents of multi web pages of a website using BeautifulSoup and Selenium

The website I want to scrap is :我想废弃的网站是:

http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061 http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061

I want to get the last page number of the above the link for proceeding, which is 499 while taking the screenshot.我想获取上述链接的最后一个页码以进行继续操作,截屏时为 499。

屏幕截图显示了我现在作为输出获得的最后一个页码

My code :我的代码:

   from bs4 import BeautifulSoup 
   from urllib.request import urlopen as uReq
   from selenium import webdriver;import time
   from selenium.webdriver.common.by import By
   from selenium.webdriver.support.ui import WebDriverWait
   from selenium.webdriver.support import expected_conditions as EC
   from selenium.webdriver.common.desired_capabilities import         DesiredCapabilities

   firefox_capabilities = DesiredCapabilities.FIREFOX
   firefox_capabilities['marionette'] = True
   firefox_capabilities['binary'] = '/etc/firefox'

   driver = webdriver.Firefox(capabilities=firefox_capabilities)
   url = "http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061"

   driver.get(url)
   wait = WebDriverWait(driver, 10)
   soup=BeautifulSoup(driver.page_source,"lxml")
   containers = soup.findAll("ul",{"class":"pages table"})
   containers[0] = soup.findAll("li")
   li_len = len(containers[0])
   for item in soup.find("ul",{"class":"pages table"}) : 
   li_text = item.select("li")[li_len].text
   print("li_text : {}\n".format(li_text))
   driver.quit()

I need help to figure out the error in my code for getting the last page number.我需要帮助找出获取最后页码的代码中的错误。 Also, I would be grateful if someone give the alternate solution for the same and suggest ways to achieve my intention.另外,如果有人为此提供替代解决方案并建议实现我的意图的方法,我将不胜感激。

If you want to get the last page number of the above the link for proceeding, which is 499 you can use either Selenium or Beautifulsoup as follows :如果您想获得上述链接的最后页码,即499您可以使用SeleniumBeautifulsoup ,如下所示:


Selenium :硒:

from selenium import webdriver

driver = webdriver.Firefox(executable_path=r'C:\Utility\BrowserDrivers\geckodriver.exe')
url = "http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061"
driver.get(url)
element = driver.find_element_by_xpath("//div[@class='row pagination']//p/span[contains(.,'Reviews on Reliance Jio')]")
driver.execute_script("return arguments[0].scrollIntoView(true);", element)
print(driver.find_element_by_xpath("//ul[@class='pagination table']/li/ul[@class='pages table']//li[last()]/a").get_attribute("innerHTML"))
driver.quit()

Console Output :控制台输出:

499

Beautifulsoup :美汤:

import bs4
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq

url = "http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061"
uClient = uReq(url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
container = page_soup.find("ul",{"class":"pages table"})
all_li = container.findAll("li")
last_div = None
for last_div in all_li:pass
if last_div:
    content = last_div.getText()
    print(content)

Console Output :控制台输出:

499

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM