[英]using selenium to display 'next' search results using jscript _doPostBack links
In search results of jobquest site ( http://jobquest.detma.org/JobQuest/Training.aspx ), I would like to use selenium to click the "next" link so that the next paginated results table of 20 records would load. 在求职网站( http://jobquest.detma.org/JobQuest/Training.aspx )的搜索结果中,我想使用硒单击“下一个”链接,以便加载下一个20条记录的分页结果表。 I can only scrape as far as the first 20 results.
我只能抓到前20个结果。 Here are my steps that got me that far:
这是我达到目标的步骤:
Step1 : I load the opening page. 步骤1 :我加载了开始页面。
import requests, re
from bs4 import BeautifulSoup
from selenium import webdriver
browser = webdriver.Chrome('../chromedriver')
url ='http://jobquest.detma.org/JobQuest/Training.aspx'
browser.get(url)
Step2 : I find the search button and click it to request a search with no search criteria. 第2步 :找到搜索按钮,然后单击它以请求没有搜索条件的搜索。 After this code, the search results page loads with the first 20 records in a table:
使用此代码后,搜索结果页面将在表中加载前20条记录:
submit_button = browser.find_element_by_id('ctl00_ctl00_bodyMainBase_bodyMain_btnSubmit')
submit_button.click()
Step3 : Now on the search results page, I create some soup and "find_all" to get the correct rows 步骤3 :现在,在搜索结果页面上,我创建了一些汤和“ find_all”以获取正确的行
html = browser.page_source
soup = BeautifulSoup(html, "html.parser")
rows = soup.find_all("tr",{"class":"gvRow"})
At this point, I can fetch my data (job ids) from the first page of results using rows object like this: 在这一点上,我可以使用以下行对象从结果的第一页获取数据(作业ID):
id_list=[]
for row in rows:
temp = str(row.find("a"))[33:40]
id_list.append(temp)
QUESTION - Step4 Help!! 问题-Step4帮助! To reload the table with the next 20 results, I have to click the "next" link on the results page.
要用下20个结果重新加载表,我必须单击结果页面上的“下一个”链接。 I used Chrome to inspect it and got these details:
我使用Chrome对其进行了检查,并获得了以下详细信息:
<a href="javascript:__doPostBack('ctl00$ctl00$bodyMainBase$bodyMain$egvResults$ctl01$ctl08','')">Next</a>
I need code to programmatically click on Next and remake the soup with the next 20 records. 我需要代码以编程方式单击“下一步”,并用接下来的20条记录重新制作汤。 I expect that if I could figure this out, I can figure out how to loop the code to get all ~1515 IDs in the database.
我希望,如果我能弄清楚这一点,就可以弄清楚如何循环代码以获取数据库中的所有〜1515个ID。
UPDATE The line that worked for me, suggested in the answer is: 更新答案中建议的对我有用的行是:
WebDriverWait(browser, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '[href*=ctl08]'))).click()
Thank you, this was very useful. 谢谢,这非常有用。
You can use an attribute = value selector to target the href
. 您可以使用attribute = value选择器来定位
href
。 In this case I use the substring at the end via contains ( *
) operator. 在这种情况下,我将通过contains(
*
)运算符在最后使用子字符串。
WebDriverWait(browser, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '[href*=ctl08]'))).click()
I add in a wait for clickable condition as a precautionary measure. 我添加了等待可点击条件作为预防措施。 You could probably remove that.
您可以删除它。
Additional imports 额外进口
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
Without wait condition: 没有等待条件:
browser.find_element_by_css_selector('[href*=ctl08]'),click()
Another way: 其他方式:
Now, instead, you could initially set the page results count to 100 (the max) and then loop through the dropdown for the pages of results to load each new page (then you don't need to work about how many pages) 现在,相反,您可以最初将页面结果数设置为100(最大值),然后遍历结果页面的下拉列表以加载每个新页面(然后,您无需处理多少页面)
import requests, re
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
browser = webdriver.Chrome()
url ='http://jobquest.detma.org/JobQuest/Training.aspx'
browser.get(url)
submit_button = browser.find_element_by_id('ctl00_ctl00_bodyMainBase_bodyMain_btnSubmit')
submit_button.click()
WebDriverWait(browser, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '[value="100"]'))).click()
html = browser.page_source
soup = BeautifulSoup(html, "html.parser")
rows = soup.find_all("tr",{"class":"gvRow"})
id_list=[]
for row in rows:
temp = str(row.find("a"))[33:40]
id_list.append(temp)
elems = browser.find_elements_by_css_selector('#ctl00_ctl00_bodyMainBase_bodyMain_egvResults select option')
i = 1
while i < len(elems) / 2:
browser.find_element_by_css_selector('#ctl00_ctl00_bodyMainBase_bodyMain_egvResults select option[value="' + str(i) + '"]').click()
#do stuff with new page
i+=1
You decide what to do with the extracting rows info from each page. 您决定如何处理从每个页面提取行信息。 This was to give you an easy framework for looping all the pages.
这是为了给您提供一个循环所有页面的简单框架。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.