[英]Scraping contents of multi web pages of a website using BeautifulSoup and Selenium
The website I want to scrap is :我想废弃的网站是:
http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061 http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061
I want to get the last page number of the above the link for proceeding, which is 499 while taking the screenshot.我想获取上述链接的最后一个页码以进行继续操作,截屏时为 499。
My code :我的代码:
from bs4 import BeautifulSoup
from urllib.request import urlopen as uReq
from selenium import webdriver;import time
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
firefox_capabilities = DesiredCapabilities.FIREFOX
firefox_capabilities['marionette'] = True
firefox_capabilities['binary'] = '/etc/firefox'
driver = webdriver.Firefox(capabilities=firefox_capabilities)
url = "http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061"
driver.get(url)
wait = WebDriverWait(driver, 10)
soup=BeautifulSoup(driver.page_source,"lxml")
containers = soup.findAll("ul",{"class":"pages table"})
containers[0] = soup.findAll("li")
li_len = len(containers[0])
for item in soup.find("ul",{"class":"pages table"}) :
li_text = item.select("li")[li_len].text
print("li_text : {}\n".format(li_text))
driver.quit()
I need help to figure out the error in my code for getting the last page number.我需要帮助找出获取最后页码的代码中的错误。 Also, I would be grateful if someone give the alternate solution for the same and suggest ways to achieve my intention.
另外,如果有人为此提供替代解决方案并建议实现我的意图的方法,我将不胜感激。
If you want to get the last page number of the above the link for proceeding, which is 499
you can use either Selenium
or Beautifulsoup
as follows :如果您想获得上述链接的最后页码,即
499
您可以使用Selenium
或Beautifulsoup
,如下所示:
from selenium import webdriver
driver = webdriver.Firefox(executable_path=r'C:\Utility\BrowserDrivers\geckodriver.exe')
url = "http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061"
driver.get(url)
element = driver.find_element_by_xpath("//div[@class='row pagination']//p/span[contains(.,'Reviews on Reliance Jio')]")
driver.execute_script("return arguments[0].scrollIntoView(true);", element)
print(driver.find_element_by_xpath("//ul[@class='pagination table']/li/ul[@class='pages table']//li[last()]/a").get_attribute("innerHTML"))
driver.quit()
Console Output :控制台输出:
499
import bs4
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
url = "http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061"
uClient = uReq(url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
container = page_soup.find("ul",{"class":"pages table"})
all_li = container.findAll("li")
last_div = None
for last_div in all_li:pass
if last_div:
content = last_div.getText()
print(content)
Console Output :控制台输出:
499
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.