簡體   English   中英

機械化和Python,單擊href =“javascript:void(0);”鏈接並獲取響應

[英]Mechanize and Python, clicking href=“javascript:void(0);” links and getting the response back

我需要從頁面中刪除一些數據,我填寫表單(已經使用mechanize進行了此操作)。 問題是,頁面在很多頁面上返回數據,而且從這些頁面獲取數據我遇到了麻煩。

從第一個結果頁面獲取它們沒有問題,因為它在搜索后已經顯示 - 我只需提交表單並獲得響應。

我分析了結果頁面的源代碼,它似乎使用了Java Script,RichFaces(JSF的一些lib和ajax,但我可能是錯的,因為我不是網絡專家)。

但是,我設法弄清楚如何到達剩余的結果頁面。 我需要點擊這種形式的鏈接( href="javascript:void(0);" ,完整代碼如下):

<td class="pageNumber"><span class="rf-ds " id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233"><span class="rf-ds-nmb-btn rf-ds-act " id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_1">1</span><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_2">2</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_3">3</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_4">4</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_5">5</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_6">6</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_7">7</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_8">8</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_9">9</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_10">10</a><a class="rf-ds-btn rf-ds-btn-next" href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_next">»</a><a class="rf-ds-btn rf-ds-btn-last" href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_l">»»»»</a>

<script type="text/javascript">new RichFaces.ui.DataScroller("SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233",function(event,element,data){RichFaces.ajax("SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233",event,{"parameters":{"SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233:page":data.page} ,"incId":"1"} )},{"digitals":{"SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_9":"9","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_8":"8","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_7":"7","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_6":"6","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_5":"5","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_4":"4","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_3":"3","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_1":"1","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_10":"10","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_2":"2"} ,"buttons":{"right":{"SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_next":"next","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_l":"last"} } ,"currentPage":1} )</script></span></td>
<td class="pageExport"><script type="text/javascript" src="/opi/javax.faces.resource/download.js?ln=js/component&amp;b="></script><script type="text/javascript">

所以我想問一下是否有辦法點擊所有鏈接並使用mechanize獲取所有頁面(注意,在»符號之后有更多頁面可用)? 我用網絡知識詢問總傻瓜的答案:)

首先,我仍然堅持使用selenium,因為這是一個非常“重量級”的網站。 請注意,如果需要,您可以使用無頭瀏覽器( PhantomJS虛擬顯示器 )。

這里的想法是每頁分頁100行,點擊“>>”鏈接,直到它不在頁面上,這意味着我們已經到了最后一頁,沒有更多的結果要處理。 為了使解決方案可靠,我們需要使用顯式等待 :每次進入下一頁時 - 等待加載微調器的不可見性。

工作實施:

# -*- coding: utf-8 -*-
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium import webdriver
from selenium.webdriver.support.select import Select
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Firefox()
driver.maximize_window()

driver.get('https://polon.nauka.gov.pl/opi/aa/drh/zestawienie?execution=e1s1')
wait = WebDriverWait(driver, 30)

# paginate by 100
select = Select(driver.find_element_by_id("drhPageForm:drhPageTable:j_idt211:j_idt214:j_idt220"))
select.select_by_visible_text("100")

while True:
    # wait until there is no loading spinner
    wait.until(EC.invisibility_of_element_located((By.ID, "loadingPopup_content_scroller")))

    current_page = driver.find_element_by_class_name("rf-ds-act").text
    print("Current page: %d" % current_page)

    # TODO: collect the results

    # proceed to the next page
    try:
        next_page = driver.find_element_by_link_text(u"»")
        next_page.click()
    except NoSuchElementException:
        break

這適用於我:似乎所有的HTML都在page可用

import time    
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://polon.nauka.gov.pl/opi/aa/drh/zestawienie')

next_id = 'drhPageForm:drhPageTable:j_idt211:j_idt233_ds_next'

pages = []
it = 0
while it < 1795:
    time.sleep(1)
    it += 1
    bad = True
    while bad:
        try:
            driver.find_element_by_id(next_id).click()
            bad = False 
        except:
            print('retry')

    page = driver.page_source

    pages.append(page)

您可以只查詢您想要的內容,而不是首先收集和存儲所有html,但您需要lxmlBeautifulSoup

編輯:運行后確實我注意到我們犯了一個錯誤。 只是捕獲異常並重試很簡單。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM