[英]Python Selenium - Get Google search HREF
我有两个来自我的谷歌搜索站点的 href 值示例:linkedin.com/in/ AND “Software Developer” AND “London”:
<a href="https://uk.linkedin.com/in/roxana-andreea-popescu" data-ved="2ahUKEwjou5D9xeztAhUDoVwKHQStC5EQFjAAegQIARAC" ping="/url?sa=t&source=web&rct=j&url=https://uk.linkedin.com/in/roxana-andreea-popescu&ved=2ahUKEwjou5D9xeztAhUDoVwKHQStC5EQFjAAegQIARAC"><br><h3 class="LC20lb DKV0Md"><span>Roxana Andreea Popescu - Software Developer - Gumtree ...</span></h3><div class="TbwUpd NJjxre"><cite class="iUh30 Zu0yb qLRx3b tjvcx">uk.linkedin.com<span class="dyjrff qzEoUe"><span> › roxana-andreea-popescu</span></span></cite></div></a>
<a href="https://uk.linkedin.com/in/tunjijabitta" data-ved="2ahUKEwi-tsulxuztAhXViVwKHX0HAOMQFjABegQIBBAC" ping="/url?sa=t&source=web&rct=j&url=https://uk.linkedin.com/in/tunjijabitta&ved=2ahUKEwi-tsulxuztAhXViVwKHX0HAOMQFjABegQIBBAC"><br><h3 class="LC20lb DKV0Md"><span>Tunji Jabitta - London, Greater London, United Kingdom ...</span></h3><div class="TbwUpd NJjxre"><cite class="iUh30 Zu0yb qLRx3b tjvcx">uk.linkedin.com<span class="dyjrff qzEoUe"><span> › tunjijabitta</span></span></cite></div></a>
我正在创建一个 LinkedIn 刮板,但在获取每个结果的 href 值(它们都不同)时遇到了问题,因此我可以遍历它们。
我试过了
linkedin_urls = driver.find_elements_by_xpath('//div[@class="yuRUbf"]//a')
links = [linkedin_url.get_attribute('href') for linkedin_url in linkedin_urls]
for linkedin_url in linkedin_urls:
driver.get(links)
sleep(5)
sel = Selector(text=driver.page_source)
但我得到错误A invalid argument: 'url' must be a string'
我尝试过的另一种选择是
linkedin_urls = driver.find_elements_by_xpath('//div[@class="yuRUbf"]//a[@href]')
for linkedin_url in linkedin_urls:
url = linkedin_url.get_attribute("href")
driver.get(url)
sleep(5)
sel = Selector(text=driver.page_source)
我设法打开了第一个链接,但是在尝试获取另一个链接时出现错误url = linkedin_url.get_attribute("href")
任何帮助将不胜感激,我已经坚持了很长一段时间。
您的驱动程序正在打开指向新页面的链接,但它出现了,正在丢弃上一页。 您可能需要考虑在新选项卡或 window 中打开,然后切换到该选项卡/窗口,一旦完成,go 返回上一页并继续。
建议执行:
1. 创建 function 以在新选项卡中打开链接(或元素) - 并切换到该选项卡:
from selenium.webdriver.common.action_chains import ActionChains
# Define a function which opens your element in a new tab:
def open_in_new_tab(driver, element):
"""This is better than opening in a new link since it mimics 'human' behavior"""
# What is the handle you're starting with
base_handle = driver.current_window_handle
ActionChains(driver) \
.move_to_element(element) \
.key_down(Keys.COMMAND) \
.click() \
.key_up(Keys.COMMAND) \
.perform()
# There should be 2 tabs right now...
if len(driver.window_handles)!=2:
raise ValueError(f'Length of {driver.window_handles} != 2... {len(driver.window_handles)=};')
# get the new handle
for x in driver.window_handles:
if x!= base_handle:
new_handle = x
# Now switch to the new window
driver.switch_to.window(new_handle)
2.执行+切换回主选项卡:
import time
# This returns a list of elements
linkedin_urls = driver.find_elements_by_xpath('//div[@class="yuRUbf"]//a[@href]')
# A bit redundant, but it's web scraping, so redundancy won't hurt you.
BASE_HANDLE = driver.current_window_handle # All caps so you can follow it more easily...
for element in linkedin_urls:
# switch to the new tab:
open_in_new_tab(driver, element)
# give the page a moment to load:
time.sleep(0.5)
# Do something on this page
print(driver.current_url
# Once you're done, get back to the original tab
# Go through all tabs (there should only be 2) and close each one unless
# it's the "base_handle"
for x in driver.window_handles:
if x!= base_handle:
driver.switch_to.window(x)
driver.close()
# Now switch to the new window
assert BASE_HANDLE in driver.window_handles # a quick sanity check
driver.switch_to.window(BASE_HANDLE) # this takes you back
# Finally, once you for-loop is complete, you can choose to continue with the driver or close + quit (like a human would)
driver.close()
driver.quit()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.