简体   繁体   English

为什么通过抓取 LinkedIn 它无法加载请求的 url? Python

[英]Why by scraping LinkedIn it cannot load the requested url? Python

I am trying to scrape LinkedIn, the script was working for 3 months but yesterday it crashed.我正在尝试抓取 LinkedIn,该脚本运行了 3 个月,但昨天它崩溃了。

I use selenium webdriver, Firefox with fake useragent.我使用 selenium webdriver,Firefox 和假用户代理。

The URL is https://www.linkedin.com/company/my_company/ URL 是https://www.linkedin.com/company/my_company/

def init_driver():
    """Initiates selenium webdriver.
    :return: Firefox browser instance
    """
    try:
        #  use random UserAgent to avoid captcha
        fp = webdriver.FirefoxProfile()
        fp.set_preference("general.useragent.override", UserAgent().random)
        fp.update_preferences()
        # initiate driver
        options = FirefoxOptions()
        #options.add_argument("--headless")
        return webdriver.Firefox(firefox_options=options)
    except Exception as e:
        logging.error('Exception occurred initiating webdriver', exc_info=True)

And then just open a page driver.get(url)然后只需打开一个页面 driver.get(url)

at this moment it opens it but cannot load此时它打开它但无法加载在此处输入图像描述

the same situation happens without fake agent and by using chrome.在没有假代理和使用 chrome 的情况下也会发生同样的情况。

Has anyone encountered something like this?有没有人遇到过这样的事情? When I open the link myself everything os ok.当我自己打开链接时,一切正常。

https://www.linkedin.com/authwall?trk=gf&trkInfo=AQFvPeNP8NQIxwAAAXLqc-uI5rnQe1ZIysPcZOgjZCzbrBHZj7q6gd68fPG9NzbX00Rlre_yC0tITChjMDEXSNnD8tZRaMXqcRG-z_3QUMlCvQPR4uVGBQYoSOl3ycoO2E6Jl9w=&originalReferer=&sessionRedirect=https%3A%2F%2Fwww.linkedin.com%2Fcompany%2my_company%2F

Other URLs are opened without problems by the function function 打开其他 URL 没有问题

This is how you should modify your code.这就是你应该如何修改你的代码。

I modified your code and your code was executed correctly.我修改了您的代码,并且您的代码已正确执行。

from selenium import webdriver
from fake_useragent import UserAgent
import logging

def init_driver():
    """Initiates selenium webdriver.
    :return: Firefox browser instance
    """

    path = r"your firefox driver path"

    try:
        #  use random UserAgent to avoid captcha
        fp = webdriver.FirefoxProfile()
        fp.set_preference("general.useragent.override", UserAgent().random)
        fp.update_preferences()
        # initiate driver
        options = webdriver.FirefoxOptions()
        # options.add_argument("--headless")
        return webdriver.Firefox(firefox_options=options, executable_path=path)
    except Exception:
        logging.error('Exception occurred initiating webdriver', exc_info=True)




url = "your url"

driver = init_driver()


driver.get(url)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM