簡體   English   中英

使用Scrapy進行NTLM身份驗證以進行網絡抓取

[英]NTLM authentication with Scrapy for web scraping

我試圖從需要身份驗證的網站上抓取數據。
我已經能夠成功登錄使用請求和HttpNtlmAuth與以下內容:

s = requests.session()     
url = "https://website.com/things"                                                      
response = s.get(url, auth=HttpNtlmAuth('DOMAIN\\USERNAME','PASSWORD'))

我想探索Scrapy的功能,但是我無法成功進行身份驗證。

我遇到了以下中間件,看起來它可以工作,但我認為我沒有正確實現它:

https://github.com/reimund/ntlm-middleware/blob/master/ntlmauth.py

在我的settings.py中,我有

SPIDER_MIDDLEWARES = { 'test.ntlmauth.NtlmAuthMiddleware': 400, }

在我的蜘蛛班里,我有

http_user = 'DOMAIN\\USER'
http_pass = 'PASS'

我無法讓這個工作。

如果有人能夠成功地從具有NTLM身份驗證的網站上搜索,可以指出我正確的方向,我將不勝感激。

我能夠弄清楚發生了什么。

1:這被認為是“DOWNLOADER_MIDDLEWARE”而不是“SPIDER_MIDDLEWARE”。

DOWNLOADER_MIDDLEWARES = { 'test.ntlmauth.NTLM_Middleware': 400, }

2:我試圖使用的中間件需要進行大量修改。 這對我有用:

from scrapy.http import Response
import requests                                                              
from requests_ntlm import HttpNtlmAuth

class NTLM_Middleware(object):

    def process_request(self, request, spider):
        url = request.url
        pwd = getattr(spider, 'http_pass', '')
        usr = getattr(spider, 'http_user', '')
        s = requests.session()     
        response = s.get(url,auth=HttpNtlmAuth(usr,pwd))      
        return Response(url,response.status_code,{}, response.content)

在蜘蛛內,您需要做的就是設置這些變量:

http_user = 'DOMAIN\\USER'
http_pass = 'PASS'

感謝@SpaceDog上面的評論,我遇到了類似的問題,試圖使用ntlm authentification爬網內部網站。 爬蟲只會看到第一頁,因為CrawlSpider中的LinkExtractor沒有啟動。

這是我使用scrapy 1.0.5的工作解決方案

NTLM_Middleware.py

from scrapy.http import Response, HtmlResponse
import requests
from requests_ntlm import HttpNtlmAuth

class NTLM_Middleware(object):

    def process_request(self, request, spider):
        url = request.url
        usr = getattr(spider, 'http_usr', '')
        pwd = getattr(spider, 'http_pass','')
        s = requests.session()
        response = s.get(url, auth=HttpNtlmAuth(usr,pwd))
        return HtmlResponse(url,response.status_code, response.headers.iteritems(), response.content)

settings.py

import logging

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'scrapy intranet'

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS=16


# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    'intranet.NTLM_Middleware.NTLM_Middleware': 200,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware':None
}

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'scrapyelasticsearch.scrapyelasticsearch.ElasticSearchPipeline',
}

ELASTICSEARCH_SERVER='localhost'
ELASTICSEARCH_PORT=9200
ELASTICSEARCH_USERNAME=''
ELASTICSEARCH_PASSWORD=''
ELASTICSEARCH_INDEX='intranet'
ELASTICSEARCH_TYPE='pages_intranet'
ELASTICSEARCH_UNIQ_KEY='url'
ELASTICSEARCH_LOG_LEVEL=logging.DEBUG

蜘蛛/ intranetspider.py

# -*- coding: utf-8 -*-
import scrapy

#from scrapy import log
from scrapy.spiders import CrawlSpider, Rule, Spider
from scrapy.linkextractors import LinkExtractor
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.http import Response

import requests
import sys
from bs4 import BeautifulSoup

class PageItem(scrapy.Item):
    body=scrapy.Field()
    title=scrapy.Field()
    url=scrapy.Field()

class IntranetspiderSpider(CrawlSpider):
    http_usr='DOMAIN\\user'
    http_pass='pass'
    name = "intranetspider"
    protocol='https://'
    allowed_domains = ['intranet.mydomain.ca']
    start_urls = ['https://intranet.mydomain.ca/']
    rules = (Rule(LinkExtractor(),callback="parse_items",follow=True),)

    def parse_items(self, response):
        self.logger.info('Crawl de la page %s',response.url)
        item = PageItem()

        soup = BeautifulSoup(response.body)

        #remove script tags and javascript from content
        [x.extract() for x in soup.findAll('script')]

        item['body']=soup.get_text(" ", strip=True)
        item['url']=response.url

        return item

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM