Python Selenium - 获取 Google 搜索 HREF

Question

I have two examples of href values from my google search site:linkedin.com/in/ AND "Software Developer" AND "London":我有两个来自我的谷歌搜索站点的 href 值示例：linkedin.com/in/ AND “Software Developer” AND “London”：

<a href="https://uk.linkedin.com/in/roxana-andreea-popescu" data-ved="2ahUKEwjou5D9xeztAhUDoVwKHQStC5EQFjAAegQIARAC" ping="/url?sa=t&amp;source=web&amp;rct=j&amp;url=https://uk.linkedin.com/in/roxana-andreea-popescu&amp;ved=2ahUKEwjou5D9xeztAhUDoVwKHQStC5EQFjAAegQIARAC"><br><h3 class="LC20lb DKV0Md"><span>Roxana Andreea Popescu - Software Developer - Gumtree ...</span></h3><div class="TbwUpd NJjxre"><cite class="iUh30 Zu0yb qLRx3b tjvcx">uk.linkedin.com<span class="dyjrff qzEoUe"><span> › roxana-andreea-popescu</span></span></cite></div></a>

<a href="https://uk.linkedin.com/in/tunjijabitta" data-ved="2ahUKEwi-tsulxuztAhXViVwKHX0HAOMQFjABegQIBBAC" ping="/url?sa=t&amp;source=web&amp;rct=j&amp;url=https://uk.linkedin.com/in/tunjijabitta&amp;ved=2ahUKEwi-tsulxuztAhXViVwKHX0HAOMQFjABegQIBBAC"><br><h3 class="LC20lb DKV0Md"><span>Tunji Jabitta - London, Greater London, United Kingdom ...</span></h3><div class="TbwUpd NJjxre"><cite class="iUh30 Zu0yb qLRx3b tjvcx">uk.linkedin.com<span class="dyjrff qzEoUe"><span> › tunjijabitta</span></span></cite></div></a>

I am creating a LinkedIn scraper and I am having a problem when it comes to getting the href value (Which all differ) for each of the results so I can loop through them.我正在创建一个 LinkedIn 刮板，但在获取每个结果的 href 值（它们都不同）时遇到了问题，因此我可以遍历它们。

I tried我试过了

     linkedin_urls = driver.find_elements_by_xpath('//div[@class="yuRUbf"]//a')
links = [linkedin_url.get_attribute('href') for linkedin_url in linkedin_urls]
for linkedin_url in linkedin_urls:
    driver.get(links)
    sleep(5)
    sel = Selector(text=driver.page_source)

But I get the errror A invalid argument: 'url' must be a string'但我得到错误A invalid argument: 'url' must be a string'

Another alternative I have tried was我尝试过的另一种选择是

linkedin_urls = driver.find_elements_by_xpath('//div[@class="yuRUbf"]//a[@href]')
for linkedin_url in linkedin_urls:
    url = linkedin_url.get_attribute("href")

    driver.get(url)
    sleep(5)
    sel = Selector(text=driver.page_source)

I managed to get the first link opened but it through an error url = linkedin_url.get_attribute("href") when trying to get the other link我设法打开了第一个链接，但是在尝试获取另一个链接时出现错误url = linkedin_url.get_attribute("href")

Any help would be greatly appreciated, I have been stuck on this for quite a while.任何帮助将不胜感激，我已经坚持了很长一段时间。

Answer 1

Your driver is opening the link to the new page but it appears, is discarding the previous page.您的驱动程序正在打开指向新页面的链接，但它出现了，正在丢弃上一页。 You may want to consider opening in a new tab or window, then switching to that tab/window, once complete, go back to previous page and continue.您可能需要考虑在新选项卡或 window 中打开，然后切换到该选项卡/窗口，一旦完成，go 返回上一页并继续。

Suggested execution:建议执行：

1. Create a function to open link (or element) in a new tab – and to switch to that tab: 1. 创建 function 以在新选项卡中打开链接（或元素） - 并切换到该选项卡：

from selenium.webdriver.common.action_chains import ActionChains

# Define a function which opens your element in a new tab:
def open_in_new_tab(driver, element):
    """This is better than opening in a new link since it mimics 'human' behavior"""
    # What is the handle you're starting with
    base_handle = driver.current_window_handle    

    ActionChains(driver) \
        .move_to_element(element) \
        .key_down(Keys.COMMAND) \
        .click() \
        .key_up(Keys.COMMAND) \
        .perform()

    # There should be 2 tabs right now... 
    if len(driver.window_handles)!=2:
        raise ValueError(f'Length of {driver.window_handles} != 2... {len(driver.window_handles)=};')

    # get the new handle
    for x in driver.window_handles:
        if x!= base_handle:
                new_handle = x
    # Now switch to the new window
    driver.switch_to.window(new_handle)

2. Execute + Switch back to the main tab: 2.执行+切换回主选项卡：

import time

# This returns a list of elements
linkedin_urls = driver.find_elements_by_xpath('//div[@class="yuRUbf"]//a[@href]')

# A bit redundant, but it's web scraping, so redundancy won't hurt you.
BASE_HANDLE = driver.current_window_handle # All caps so you can follow it more easily...

for element in linkedin_urls:
    # switch to the new tab:
    open_in_new_tab(driver, element)
    
    # give the page a moment to load:
    time.sleep(0.5)
        
    # Do something on this page
    print(driver.current_url

    # Once you're done, get back to the original tab
    # Go through all tabs (there should only be 2) and close each one unless
    # it's the "base_handle"

    for x in driver.window_handles:
        if x!= base_handle:
            driver.switch_to.window(x)
            driver.close()
    # Now switch to the new window
    assert BASE_HANDLE in driver.window_handles # a quick sanity check
    driver.switch_to.window(BASE_HANDLE) # this takes you back

# Finally, once you for-loop is complete, you can choose to continue with the driver or close + quit (like a human would)
driver.close()
driver.quit()

Python Selenium - 获取 Google 搜索 HREF

问题描述

1 个解决方案

解决方案1
0 2020-12-26 21:55:07

Python Selenium - 获取 Google 搜索 HREF

问题描述

1 个解决方案

解决方案1 0 2020-12-26 21:55:07

解决方案1
0 2020-12-26 21:55:07