Python Selenium从find_elements_by_partial_link_text中提取href信息

Question

Im working on pulling some data from a website, I can successfully surf to the page that lists all the updated data from the day before, but now I need to iterate through all the links, and save the source of each page to a file. 我正在努力从网站上获取一些数据，我可以成功浏览到列出前一天所有更新数据的页面，但是现在我需要遍历所有链接，并将每个页面的源保存到文件中。

Once in a file I want to use BeautifulSoup to better arrange the data so I can parse through it. 进入文件后，我想使用BeautifulSoup更好地安排数据，以便我可以解析它。

#learn.py
from BeautifulSoup import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

url1 = 'https://odyssey.tarrantcounty.com/default.aspx'
date = '07/31/2014'
option_by_date = "6"
driver = webdriver.Firefox()
driver.get(url1)
continue_link = driver.find_element_by_partial_link_text('Case')

#follow link
continue_link.click()

driver.find_element_by_xpath("//select[@name='SearchBy']/option[text()='Date Filed']").click()
#fill in dates in form
from_date = driver.find_element_by_id("DateFiledOnAfter")
from_date.send_keys(date)
to_date = driver.find_element_by_id("DateFiledOnBefore")
to_date.send_keys(date)

submit_button = driver.find_element_by_id('SearchSubmit')
submit_button.click()

link_list = driver.find_elements_by_partial_link_text('2014')

link_list should be a list of the applicable links, but I'm not sure where to go from there. link_list应该是适用链接的列表，但是我不确定从那里开始。

Answer 1

Get all links that have href attribute starting with CaseDetail.aspx?CaseID= , find_elements_by_xpath() would help here: 获取所有具有href属性（从CaseDetail.aspx?CaseID= find_elements_by_xpath()链接， find_elements_by_xpath()在这里会有所帮助：

# get the list of links
links = [link.get_attribute('href') 
         for link in driver.find_elements_by_xpath('//td/a[starts-with(@href, "CaseDetail.aspx?CaseID=")]')]
for link in links:
    # follow the link
    driver.get(link)

    # parse the data
    print driver.find_element_by_class_name('ssCaseDetailCaseNbr').text

Prints: 打印：

Case No. 2014-PR01986-2
Case No. 2014-PR01988-1
Case No. 2014-PR01989-1
...

Note that you don't need to save the pages and parse them via BeautifulSoup . 请注意，您无需保存页面并通过BeautifulSoup对其进行解析。 Selenium itself is pretty powerful in navigating and extracting the data out of the webpages. Selenium本身在浏览和提取网页数据方面非常强大。

Answer 2

You can fetch web elements using their tag name. 您可以使用其标签名称来获取Web元素。 If you want to fetch all the links in a web page, I would use find_elements_by_tag_name(). 如果要获取网页中的所有链接，我将使用find_elements_by_tag_name（）。

links = driver.find_elements_by_tag_name('a')
link_urls = [link.get_attribute('href') for link in links]
source_dict = dict()
for url in link_urls:
    driver.get(url)
    source = driver.page_source #this will give you page source
    source_dict[url] = source

#source_dict dictionary will contain the source code you wanted for each url with the url as the key.

Python Selenium从find_elements_by_partial_link_text中提取href信息

问题描述

2 个解决方案

解决方案1
0 2014-08-01 16:28:50

解决方案2
0 2014-08-01 18:46:19

Python Selenium从find_elements_by_partial_link_text中提取href信息

问题描述

2 个解决方案

解决方案1 0 2014-08-01 16:28:50

解决方案2 0 2014-08-01 18:46:19

解决方案1
0 2014-08-01 16:28:50

解决方案2
0 2014-08-01 18:46:19