简体   繁体   English

使用 Scrapy Spider 以 JavaScript 表格登录网站

[英]Use Scrapy Spider to log in to website with JavaScript form

I'm trying to make a spider that can log into a particular site and scrape all the URLs of it's subpages.我正在尝试制作一个可以登录特定站点并抓取其子页面的所有 URL 的蜘蛛。

The problem is that my spider can't find the form on the login page - apparently it's injected by JavaScript.问题是我的蜘蛛在登录页面上找不到表单——显然它是由 JavaScript 注入的。

So I tried to manually send a POST request (FormRequest), but that failed because I couldn't get the correct auth token (seems to be a hash of the email address, password and something else) - resulting in a 403 response.所以我尝试手动发送一个 POST 请求(FormRequest),但是失败了,因为我无法获得正确的身份验证令牌(似乎是 email 地址的 hash、密码和其他内容) - 导致 403 响应。 (My code for the manual login request ) (我的手动登录请求代码)

So then I tried using Selenium in my script to log in and then transfer the cookie to Scrapy - but it looks like Selenium can't see the form either!然后我尝试在我的脚本中使用 Selenium 登录,然后将 cookie 传输到 Scrapy - 但它看起来像 Selenium 也看不到表格! (My code for the Selenium Scrapy script ) (我的Selenium Scrapy 脚本代码)

Here is the console output:这是控制台 output:

2020-05-04 15:48:15 [scrapy.core.engine] INFO: Spider opened
2020-05-04 15:48:15 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-05-04 15:48:15 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-05-04 15:48:16 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:58616/session {"capabilities": {"firstMatch": [{}], "alwaysMatch": {"browserName": "chrome", "platformName": "any", "goog:chromeOptions": {"extensions": [], "args": []}}}, "desiredCapabilities": {"browserName": "chrome", "version": "", "platform": "ANY", "goog:chromeOptions": {"extensions": [], "args": []}}}
2020-05-04 15:48:16 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): 127.0.0.1:58616

DevTools listening on ws://127.0.0.1:58623/devtools/browser/c0e9971e-c9e8-485d-9447-b0e37e9398a6
[23160:19992:0504/154816.724:ERROR:browser_switcher_service.cc(238)] XXX Init()
2020-05-04 15:48:18 [urllib3.connectionpool] DEBUG: http://127.0.0.1:58616 "POST /session HTTP/1.1" 200 720
2020-05-04 15:48:18 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2020-05-04 15:48:18 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:58616/session/736d1883510320360aefb2afe29c3092/url {"url": "https://authentication.asfinag.at/login/"}
2020-05-04 15:48:18 [urllib3.connectionpool] DEBUG: http://127.0.0.1:58616 "POST /session/736d1883510320360aefb2afe29c3092/url HTTP/1.1" 200 14
2020-05-04 15:48:18 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2020-05-04 15:48:18 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:58616/session/736d1883510320360aefb2afe29c3092/element {"using": "xpath", "value": "//input[1]"}
2020-05-04 15:48:18 [urllib3.connectionpool] DEBUG: http://127.0.0.1:58616 "POST /session/736d1883510320360aefb2afe29c3092/element HTTP/1.1" 404 1038
2020-05-04 15:48:18 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2020-05-04 15:48:18 [scrapy.core.engine] ERROR: Error while obtaining start requests
Traceback (most recent call last):
  File "C:\Users\...\Miniconda3\envs\crawler-test\lib\site-packages\scrapy\core\engine.py", line 129, in _next_request
    request = next(slot.start_requests)
  File "C:\Users\...\crawlerTest\crawlerTest\spiders\hybrid_spider.py", line 24, in init_request
    driver.find_element(By.XPATH, '//input[1]').send_keys('email@dummy.com')
  File "C:\Users\...\Miniconda3\envs\crawler-test\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 976, in find_element
    return self.execute(Command.FIND_ELEMENT, {
  File "C:\Users\...\Miniconda3\envs\crawler-test\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute
    self.error_handler.check_response(response)
  File "C:\Users\...\Miniconda3\envs\crawler-test\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//input[1]"}
  (Session info: chrome=81.0.4044.129)

2020-05-04 15:48:18 [scrapy.core.engine] INFO: Closing spider (finished)

I'm open to any suggestions on how to log in with my spider so I can start scraping.我愿意接受有关如何使用我的蜘蛛登录的任何建议,以便我可以开始抓取。

Ok, so the problem with my selenium script was that the script ended before the page had loaded.好的,所以我的 selenium 脚本的问题是脚本在页面加载之前结束。 I added a driver.implicitly_wait(15) before trying to find the input elements and logging in, and that solved it.在尝试查找输入元素并登录之前,我添加了一个driver.implicitly_wait(15) ,这解决了它。

Here is the complete code for my working spider:这是我的工作蜘蛛的完整代码:

import scrapy
from scrapy.spiders.init import InitSpider
from scrapy.http import Request
from scrapy.selector import Selector
import selenium
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

#Provide domain, so the spider does not follow external links
DOMAIN = 'test.org'
URL = 'http://%s' % DOMAIN

class HybridSpider(InitSpider):
    name = 'hybrid_spider'
    allowed_domains = [DOMAIN]
    start_urls = [URL]
    url_array = []

    def init_request(self):
        #Provide site information
        username = 'email@dummy.com'
        password = 'test1234'
        login_url = 'https://authentication.test.org/login/'
        start_crawling_url = 'https://www.test.org'

        #Set path to your chromedriver.exe here.
        driver = webdriver.Chrome('C:/Users/somepath/chromedriver.exe')

        driver.get(login_url)
        driver.implicitly_wait(15)
        #If the input field has an id, use it to find the correct field
        driver.find_element_by_id('tbUsername').send_keys(username)
        driver.find_element_by_id('tbPassword').send_keys(password, Keys.ENTER)
        #If it doesn't, get all input fields from the page and tell Selenium to use the n-th one
        #driver.find_element(By.XPATH, '//input[1]').send_keys(username)
        #driver.find_element(By.XPATH, '//input[last()]').send_keys(password, Keys.ENTER)

        driver.implicitly_wait(15)
        yield Request(start_crawling_url, cookies=driver.get_cookies(), callback=self.parse, headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0'})


    def parse(self, response):
        hxs = Selector(response)
        filename = 'urls.txt'

        for url in hxs.xpath('//a/@href').extract():
            if ((url.startswith('http://') or url.startswith('https://') or url.startswith('/')) and not (url.endswith('.pdf') or url.endswith('.zip') or url.endswith('.xls') or url.endswith('.jpg'))):
                if not ( url.startswith('http://') or url.startswith('https://') ):
                    url= URL + url

                cleanedUrl = url.lower().replace("http://",'').replace("https://",'').replace("www.",'')
                if cleanedUrl.endswith("/"):
                    cleanedUrl = cleanedUrl[:-1]

                if cleanedUrl in self.url_array:
                    pass

                else:
                    self.url_array.append(url)
                    print(cleanedUrl)
                    with open(filename, 'a+') as f:
                        f.write(cleanedUrl+'\n')

                    #The spider doesn't follow redirects with status 301 and 302. If you want it to follow redirects, delete the meta object
                    yield Request(url, callback=self.parse, headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0'}, meta={'dont_redirect': True,"handle_httpstatus_list": [301, 302]})

Just in case another Scrapy newbie stumbles upon this and tries to make this spider work, here are the setup instructions :以防万一另一个 Scrapy 新手偶然发现这个并试图让这个蜘蛛工作,这里是设置说明

Save the above code to a file named hybrid_spider.py.将上述代码保存到名为 hybrid_spider.py 的文件中。

Install Miniconda: https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html安装 Miniconda: https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html

Check what version of Chrome browser you have installed and download the matching chromedriver.exe from https://chromedriver.chromium.org/downloads .检查您安装的 Chrome 浏览器版本,并从https://chromedriver.chromium.org/downloads 下载匹配的 chromedriver.exe Then update the chromedriver path in the hybrid_spider.py script.然后更新 hybrid_spider.py 脚本中的 chromedriver 路径。 (Ideally place it in the folder of the spider) (最好放在蜘蛛的文件夹中)

Open 'Anaconda Prompt (Miniconda3), and there打开'Anaconda Prompt (Miniconda3),然后

  • make new environment: conda create --name crawler-test创建新环境: conda create --name crawler-test
  • activate environment: conda activate crawler-test激活环境: conda activate crawler-test
  • install Scrapy: conda install -c conda-forge scrapy安装 Scrapy: conda conda install -c conda-forge scrapy
  • install Selenium: conda install -c conda-forge selenium安装 Selenium: conda conda install -c conda-forge selenium

With file explorer, navigate to crawlerTest directory, subfolder spiders (mine was at C:\Users\username\crawlerTest\crawlerTest\spiders) and place hybrid_spider.py there.使用文件资源管理器,导航到 crawlerTest 目录,子文件夹 spiders(我的是 C:\Users\username\crawlerTest\crawlerTest\spiders)并将 hybrid_spider.py 放在那里。

For each website you want to crawl, change login info directly in the hybrid_spider.py script.对于您要抓取的每个网站,直接在 hybrid_spider.py 脚本中更改登录信息。

To start scraping URLs, in Anaconda Prompt:要开始抓取 URL,在 Anaconda 提示中:

  • activate environment: conda activate crawler-test激活环境: conda activate crawler-test
  • navigate to crawlerTest directory (mine was at C:\Users\username\crawlerTest)导航到 crawlerTest 目录(我的是 C:\Users\username\crawlerTest)
  • start spider: python -m scrapy crawl hybrid_spider启动蜘蛛: python -m scrapy crawl hybrid_spider

...will create a file urls.txt in the crawlerTest directory. ...将在 crawlerTest 目录中创建一个文件 urls.txt。 (Overwrites file on each crawl) (在每次爬网时覆盖文件)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM