如何使用美麗的湯4和蟒蛇和硒循環頁面？

Question

雖然我有一些硒的經驗，但我是第一次使用Python並且使用美麗的湯。 我正試圖刮一個網站（“ http://cbseaff.nic.in/cbse_aff/schdir_Report/userview.aspx ”）所有聯盟號碼。

問題是它們在多個頁面上（20個結果為1，總計：21,000+結果）

所以，我希望在某種可以遍歷下一頁btn的循環中抓取這些，網頁的URL中的問題不會改變，因此沒有模式。

好吧，對於這個我試過，谷歌表導入HTML /導入XML方法，但由於大規模的問題，它只是掛起。 接下來我嘗試了python並開始閱讀使用python進行抓取（我是第一次這樣做:)）這個平台上的一個人提出了一個方法

（ Python Requests / BeautifulSoup訪問分頁）

我試圖做同樣的事情，但很少也沒有成功。

此外，要獲取結果，我們必須首先使用關鍵字“a”查詢搜索欄 - >然后單擊搜索。 只有這樣網站才會顯示結果。

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by  import By
import time

options = webdriver.ChromeOptions()
options.add_argument("headless")
driver = webdriver.Chrome(executable_path=r"C:\chromedriver.exe",options=options)

driver.get("http://cbseaff.nic.in/cbse_aff/schdir_Report/userview.aspx")
#click on the radio btn
driver.find_element(By.ID,'optlist_0').click()

time.sleep(2)

# Search the query with letter A And Click Search btn
driver.find_element(By.ID,'keytext').send_Keys("a")
driver.find_element(By.ID,'search').click()

time.sleep(2)

next_button = driver.find_element_by_id("Button1")
data = []
try:
    while (next_button):    
        soup = BeautifulSoup(driver.page_source,'html.parser')
        table = soup.find('table',{'id':'T1'}) #Main Table
        table_body = table.find('tbody') #get inside the body
        rows = table_body.find_all('tr') #look for all tablerow
        for row in rows:            
            cols = row.find_all('td')  # in every Tablerow, look for tabledata
                for row2 in cols:
                    #table -> tbody ->tr ->td -><b> --> exit loop. ( only first tr is our required data, print this)

我期望的最終結果是跨多個頁面的所有聯屬編號列表。

Answer 1

在while循環中添加一小段代碼：

next_button = 1 #Initialise the variable for the first instance of while loop

while next_button:
    #First scroll to the bottom of the page
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") 
    #Now locate the button & click on it
    next_button = driver.find_element(By.ID,'Button1')
    next_button.click()
    ###
    ###Beautiful Soup Code : Fetch the page source now & do your thing###
    ###
    #Adjust the timing as per your requirement
    time.sleep(2)

請注意，滾動到頁面底部非常重要，否則會彈出一個錯誤，聲稱“Button1”元素隱藏在頁腳下方。 因此，使用腳本（在循環開始時），瀏覽器將向下移動到頁面底部。 在這里，它可以清楚地看到'Button1'元素。 現在，找到元素，執行點擊操作，然后讓你的美麗湯接管。

如何使用美麗的湯4和蟒蛇和硒循環頁面？

問題描述

1 個解決方案

解決方案1
0 2019-06-24 00:34:45

如何使用美麗的湯4和蟒蛇和硒循環頁面？

問題描述

1 個解決方案

解決方案1 0 2019-06-24 00:34:45

解決方案1
0 2019-06-24 00:34:45