简体   繁体   中英

Selenium, not sure how to retrieve links from elements found through xpath

I'm working with Selenium and python to try to retrieve the links to all of the found Google Patents in a Google Patents search. This is the code I currently have:

import urllib.request
from selenium import webdriver

urlpage = 'https://patents.google.com/?assignee=pfizer&after=priority:20010602&type=PATENT' 
print(urlpage)
driver = webdriver.Firefox()

driver.get(urlpage)

results = driver.find_elements_by_xpath("//*[@id='link']")

data = []
for result in results:
    product_name = result.text
    link = result.find_element_by_tag_name('a')
    product_link = link.get_attribute("href")
    data.append({"product" : product_name, "link" : product_link})

For example, I'm trying to retrieve

/patent/AU2016304408B2/en?assignee=Pfizer&after=priority:20110602&type=PATENT&num=100

from the first link on the page. However, I'm not sure if I'm using the right xpath to find all of the links.

I'm also getting an error from the steps to appending all of the links into the array, which is "Message: Unable to locate element: a". Printing out 'results', I'm not sure what to take from it so I'm having trouble debugging it myself.

There are several challenges with this page.

  1. Not all of the #link elements are visible or are what you are looking for.
  2. The href of the A tag doesn't actually get filled until you hover the link. (yuck)
  3. There's a lot of invalid HTML on the page... multiple elements with the same ID, etc.

...among other challenges...

I found a locator that will narrow down the elements to just the visible elements that have the articles you are looking for. Getting the name of the article is pretty straightforward. Getting the URL of the A tag took some work... I basically built it myself given the domain name, your query parameters, and the patent # from a different element.

from selenium import webdriver

url_domain = 'https://patents.google.com/' 
url_query = '?assignee=pfizer&after=priority:20010602&type=PATENT' 
print(url_domain + url_query)
driver = webdriver.Firefox()

driver.get(url_domain + url_query)

results = driver.find_elements_by_css_selector("#main article")

data = []
for result in results:
    product_name = result.find_element_by_id('link').text
    url_patent = result.find_element_by_css_selector('state-modifier').get_attribute('data-result')
    product_link = url_domain + url_patent + url_query
    data.append({"product" : product_name, "link" : product_link})

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM