I'm working with Selenium and python to try to retrieve the links to all of the found Google Patents in a Google Patents search. This is the code I currently have:
import urllib.request
from selenium import webdriver
urlpage = 'https://patents.google.com/?assignee=pfizer&after=priority:20010602&type=PATENT'
print(urlpage)
driver = webdriver.Firefox()
driver.get(urlpage)
results = driver.find_elements_by_xpath("//*[@id='link']")
data = []
for result in results:
product_name = result.text
link = result.find_element_by_tag_name('a')
product_link = link.get_attribute("href")
data.append({"product" : product_name, "link" : product_link})
For example, I'm trying to retrieve
/patent/AU2016304408B2/en?assignee=Pfizer&after=priority:20110602&type=PATENT&num=100
from the first link on the page. However, I'm not sure if I'm using the right xpath to find all of the links.
I'm also getting an error from the steps to appending all of the links into the array, which is "Message: Unable to locate element: a". Printing out 'results', I'm not sure what to take from it so I'm having trouble debugging it myself.
There are several challenges with this page.
#link
elements are visible or are what you are looking for....among other challenges...
I found a locator that will narrow down the elements to just the visible elements that have the articles you are looking for. Getting the name of the article is pretty straightforward. Getting the URL of the A tag took some work... I basically built it myself given the domain name, your query parameters, and the patent # from a different element.
from selenium import webdriver
url_domain = 'https://patents.google.com/'
url_query = '?assignee=pfizer&after=priority:20010602&type=PATENT'
print(url_domain + url_query)
driver = webdriver.Firefox()
driver.get(url_domain + url_query)
results = driver.find_elements_by_css_selector("#main article")
data = []
for result in results:
product_name = result.find_element_by_id('link').text
url_patent = result.find_element_by_css_selector('state-modifier').get_attribute('data-result')
product_link = url_domain + url_patent + url_query
data.append({"product" : product_name, "link" : product_link})
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.