简体   繁体   English

Python Selenium从find_elements_by_partial_link_text中提取href信息

[英]Python Selenium pull href info out of find_elements_by_partial_link_text

Im working on pulling some data from a website, I can successfully surf to the page that lists all the updated data from the day before, but now I need to iterate through all the links, and save the source of each page to a file. 我正在努力从网站上获取一些数据,我可以成功浏览到列出前一天所有更新数据的页面,但是现在我需要遍历所有链接,并将每个页面的源保存到文件中。

Once in a file I want to use BeautifulSoup to better arrange the data so I can parse through it. 进入文件后,我想使用BeautifulSoup更好地安排数据,以便我可以解析它。

#learn.py
from BeautifulSoup import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

url1 = 'https://odyssey.tarrantcounty.com/default.aspx'
date = '07/31/2014'
option_by_date = "6"
driver = webdriver.Firefox()
driver.get(url1)
continue_link = driver.find_element_by_partial_link_text('Case')

#follow link
continue_link.click()

driver.find_element_by_xpath("//select[@name='SearchBy']/option[text()='Date Filed']").click()
#fill in dates in form
from_date = driver.find_element_by_id("DateFiledOnAfter")
from_date.send_keys(date)
to_date = driver.find_element_by_id("DateFiledOnBefore")
to_date.send_keys(date)

submit_button = driver.find_element_by_id('SearchSubmit')
submit_button.click()

link_list = driver.find_elements_by_partial_link_text('2014')

link_list should be a list of the applicable links, but I'm not sure where to go from there. link_list应该是适用链接的列表,但是我不确定从那里开始。

Get all links that have href attribute starting with CaseDetail.aspx?CaseID= , find_elements_by_xpath() would help here: 获取所有具有href属性(从CaseDetail.aspx?CaseID= find_elements_by_xpath()链接, find_elements_by_xpath()在这里会有所帮助:

# get the list of links
links = [link.get_attribute('href') 
         for link in driver.find_elements_by_xpath('//td/a[starts-with(@href, "CaseDetail.aspx?CaseID=")]')]
for link in links:
    # follow the link
    driver.get(link)

    # parse the data
    print driver.find_element_by_class_name('ssCaseDetailCaseNbr').text

Prints: 打印:

Case No. 2014-PR01986-2
Case No. 2014-PR01988-1
Case No. 2014-PR01989-1
...

Note that you don't need to save the pages and parse them via BeautifulSoup . 请注意,您无需保存页面并通过BeautifulSoup对其进行解析。 Selenium itself is pretty powerful in navigating and extracting the data out of the webpages. Selenium本身在浏览和提取网页数据方面非常强大。

You can fetch web elements using their tag name. 您可以使用其标签名称来获取Web元素。 If you want to fetch all the links in a web page, I would use find_elements_by_tag_name(). 如果要获取网页中的所有链接,我将使用find_elements_by_tag_name()。

links = driver.find_elements_by_tag_name('a')
link_urls = [link.get_attribute('href') for link in links]
source_dict = dict()
for url in link_urls:
    driver.get(url)
    source = driver.page_source #this will give you page source
    source_dict[url] = source

#source_dict dictionary will contain the source code you wanted for each url with the url as the key.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM