简体   繁体   English

如何使用美丽的汤4和蟒蛇和硒循环页面?

[英]how to loop pages using beautiful soup 4 and python and selenium?

I'm Fairly new to Python and using beautiful soup first time though I have some experience with selenium. 虽然我有一些硒的经验,但我是第一次使用Python并且使用美丽的汤。 I am trying to scrape a website (" http://cbseaff.nic.in/cbse_aff/schdir_Report/userview.aspx " ) For all the affiliation number. 我正试图刮一个网站(“ http://cbseaff.nic.in/cbse_aff/schdir_Report/userview.aspx ”)所有联盟号码。

The problem is they are on multiple pages( 20 result on 1, total: 21,000+ result) 问题是它们在多个页面上(20个结果为1,总计:21,000+结果)

so, I wish to scrape these in some kind of loop that can iterate over the next page btn, the problem in URL of the web page does not change and thus there is no pattern. 所以,我希望在某种可以遍历下一页btn的循环中抓取这些,网页的URL中的问题不会改变,因此没有模式。

Okay so for this i have tried, google sheet Import HTML/ Import XML method but due to large scale of problem it just hangs. 好吧,对于这个我试过,谷歌表导入HTML /导入XML方法,但由于大规模的问题,它只是挂起。 Next I tried python and started reading about scraping using python (I'm doing this for the first time :) ) Some-one on this platform suggested an method 接下来我尝试了python并开始阅读使用python进行抓取(我是第一次这样做:))这个平台上的一个人提出了一个方法

( Python Requests/BeautifulSoup access to pagination ) Python Requests / BeautifulSoup访问分页

I am trying to do the same but with little and no success. 我试图做同样的事情,但很少也没有成功。

Also, to fetch the result we have to first, query the search bar with the keyword "a" --> then click search. 此外,要获取结果,我们必须首先使用关键字“a”查询搜索栏 - >然后单击搜索。 Only then the website show result. 只有这样网站才会显示结果。

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by  import By
import time

options = webdriver.ChromeOptions()
options.add_argument("headless")
driver = webdriver.Chrome(executable_path=r"C:\chromedriver.exe",options=options)

driver.get("http://cbseaff.nic.in/cbse_aff/schdir_Report/userview.aspx")
#click on the radio btn
driver.find_element(By.ID,'optlist_0').click()

time.sleep(2)

# Search the query with letter A And Click Search btn
driver.find_element(By.ID,'keytext').send_Keys("a")
driver.find_element(By.ID,'search').click()

time.sleep(2)

next_button = driver.find_element_by_id("Button1")
data = []
try:
    while (next_button):    
        soup = BeautifulSoup(driver.page_source,'html.parser')
        table = soup.find('table',{'id':'T1'}) #Main Table
        table_body = table.find('tbody') #get inside the body
        rows = table_body.find_all('tr') #look for all tablerow
        for row in rows:            
            cols = row.find_all('td')  # in every Tablerow, look for tabledata
                for row2 in cols:
                    #table -> tbody ->tr ->td -><b> --> exit loop. ( only first tr is our required data, print this)

The final outcome I expect is List of all affiliation number across multiple pages. 我期望的最终结果是跨多个页面的所有联属编号列表。

A minor addition to the code within your while loop: while循环中添加一小段代码:

next_button = 1 #Initialise the variable for the first instance of while loop

while next_button:
    #First scroll to the bottom of the page
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") 
    #Now locate the button & click on it
    next_button = driver.find_element(By.ID,'Button1')
    next_button.click()
    ###
    ###Beautiful Soup Code : Fetch the page source now & do your thing###
    ###
    #Adjust the timing as per your requirement
    time.sleep(2)

Note the fact that scrolling to the bottom of the page is important, otherwise an error will pop up claiming 'Button1' element is hidden under the footer. 请注意,滚动到页面底部非常重要,否则会弹出一个错误,声称“Button1”元素隐藏在页脚下方。 So with the script(in the beginning of the loop), the browser will move down to the bottom of the page. 因此,使用脚本(在循环开始时),浏览器将向下移动到页面底部。 Here, it can see the 'Button1' element clearly. 在这里,它可以清楚地看到'Button1'元素。 Now, locate the element, perform the click action & then let your Beautiful Soup take over. 现在,找到元素,执行点击操作,然后让你的美丽汤接管。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM