简体   繁体   English

如何在scrapy运行之前手动进行身份验证?

[英]How can I manually authenticate before scrapy runs?

I want to scrape a web page that uses a ridiculous quantity of captcha challenges before I can login (eg more than 20 challenges in sequence). 我想在我登录之前抓取一个使用可笑的验证码数量的网页(例如顺序超过20个挑战)。

How can I login, by me solving the captcha, with my physical hands, ie not with Selenium etc., and then have the web scraping run. 如何解决物理验证码,如何用我的双手(即不使用Selenium等)登录,然后运行网络抓取。 I have tried finding code that does the same in Scrapy documentation, tutorials and web searching and found nothing. 我尝试过在Scrapy文档,教程和网络搜索中找到与上述代码相同的代码,却一无所获。

Obligatory code that doesn't do the thing that I am asking how to do: 强制性代码没有执行我要问的事情:

import scrapy

class BadSpider(scrapy.Spider):
    name = "bad"

    def start_requests(self):
        [...]

    def parse(self, response):
        if (response.url.endswith('/login')):
            print('!!!!! I have no idea what to do here!!!!')
        else:
            [...]

I want it to start after I have manually authenticated. 我希望它在手动验证后启动。 But, instead it starts and I have not logged in so I can not go further. 但是,它开始了,但是我还没有登录,所以我不能再走了。

  1. You just authenticate manually in your browser 您只需在浏览器中手动进行身份验证
  2. Then open DevTools of your browser 然后打开浏览器的DevTools
  3. Navigate to Network tab 导航到网络选项卡
  4. Re-load the page you want to scrape 重新加载您要抓取的页面
  5. Then inside the Network tab, right-click on the first request and look for Copy as cURL (bash) option 然后在“网络”标签内,右键单击第一个请求,然后查找“ Copy as cURL (bash)选项
  6. Go to https://curl.trillworks.com/ and paste your code 转到https://curl.trillworks.com/并粘贴您的代码
  7. Copy headers and cookies and boom you are done 复制标题和cookie,然后完成工作

PS: I would suggest perform this action in Mozilla Firefox, because sometimes Chrome's DevTools produces incorrect results in https://curl.trillworks.com/ PS:我建议在Mozilla Firefox中执行此操作,因为有时Chrome的DevTools在https://curl.trillworks.com/中会产生错误的结果

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM