简体   繁体   中英

scrapy not sending Cookies with POST request

I'm attempting to use scrapy to submit a POST request, but it's not sending the Cookies in the header.

Setup

Running under OSX. Created a virtualenv and ran pip install Scrapy . Then I created a default spider:

(hotlanesbot)tollspider $ scrapy startproject vai66tolls
(hotlanesbot)tollspider $ cd vai66tolls/
(hotlanesbot)vai66tolls $ scrapy genspider vai66tolls-spider vai66tolls.com

I then enabled cookies debugging in settings.py :

COOKIES_DEBUG = True

Code

The code for the spider is pretty basic: parse the site then POST the form and process the response in parse_eb . Content of vai66tolls_spider.py :

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http.cookies import CookieJar

class Vai66tollsSpiderSpider(scrapy.Spider):
    name = 'vai66tolls-spider'
    allowed_domains = ['vai66tolls.com']
    start_urls = ['http://vai66tolls.com/']

    def parse(self, response):
        filename = "/tmp/body.html"
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

        self.log('Initial Response headers: (%s)' % response.headers)

        # look for "cookie" things in response headers
        poss_cookies = response.headers.getlist('Set-Cookie')
        self.log('Set-Cookie?: (%s)' % poss_cookies)

        poss_cookies = response.headers.getlist('Cookie')
        self.log('Cookie?: (%s)' % poss_cookies)

        poss_cookies = response.headers.getlist('cookie')
        self.log('cookie?: (%s)' % poss_cookies)

        # Parse Eastbound
        r = scrapy.FormRequest.from_response(
            response,
            callback=self.parse_eb,
            )

        yield r

    def parse_eb(self, response):
        filename = "/tmp/eb.txt"
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)
        self.log('Request headers: %s' % response.request.headers)
        self.log('Request cookies: %s' % response.request.cookies)

You can view it on github here .

Output

I'm running the scraper with:

(hotlanesbot)vai66tolls $ scrapy crawl vai66tolls-spider

In the log output, I see "Received cookies" DEBUG statement, but not the "Sending cookies to" message I'd expect from the documentation / the CookiesMiddleware .

Here's a larger excerpt from the output:

2018-01-10 08:50:35 [scrapy.core.engine] INFO: Spider opened
2018-01-10 08:50:35 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-01-10 08:50:35 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-01-10 08:50:35 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://vai66tolls.com/robots.txt> from <GET http://vai66tolls.com/robots.txt>
2018-01-10 08:50:35 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://vai66tolls.com/robots.txt> (referer: None)
2018-01-10 08:50:35 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://vai66tolls.com/> from <GET http://vai66tolls.com/>
2018-01-10 08:50:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://vai66tolls.com/> (referer: None)
2018-01-10 08:50:35 [vai66tolls-spider] DEBUG: Saved file /tmp/body.html
2018-01-10 08:50:35 [vai66tolls-spider] DEBUG: Initial Response headers: ({'X-Powered-By': ['ASP.NET'], 'X-Aspnet-Version': ['4.0.30319'], 'Server': ['Microsoft-IIS/10.0'], 'Cache-Control': ['private'], 'Date': ['Wed, 10 Jan 2018 13:50:35 GMT'], 'Content-Type': ['text/html; charset=utf-8']})
2018-01-10 08:50:35 [vai66tolls-spider] DEBUG: Set-Cookie?: ([])
2018-01-10 08:50:35 [vai66tolls-spider] DEBUG: Cookie?: ([])
2018-01-10 08:50:35 [vai66tolls-spider] DEBUG: cookie?: ([])
2018-01-10 08:50:35 [scrapy.downloadermiddlewares.cookies] DEBUG: Received cookies from: <200 https://vai66tolls.com/>
Set-Cookie: ASP.NET_SessionId=im3zxr01stwmr02z0cisggbl; path=/; HttpOnly

2018-01-10 08:50:35 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://vai66tolls.com/> (referer: https://vai66tolls.com/)
2018-01-10 08:50:35 [vai66tolls-spider] DEBUG: Saved file /tmp/eb.txt
2018-01-10 08:50:35 [vai66tolls-spider] DEBUG: Request headers: {'Accept-Language': ['en'], 'Accept-Encoding': ['gzip,deflate'], 'Accept': ['text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], 'User-Agent': ['Scrapy/1.5.0 (+https://scrapy.org)'], 'Referer': ['https://vai66tolls.com/'], 'Content-Type': ['application/x-www-form-urlencoded']}
2018-01-10 08:50:35 [vai66tolls-spider] DEBUG: Request cookies: {}
2018-01-10 08:50:35 [scrapy.core.engine] INFO: Closing spider (finished)

(not shown is a line indicating scrapy.downloadermiddlewares.cookies.CookiesMiddleware is included in downloader middlewares).

For comparison, if I monitor the initial request via Chrome's debugger tools, I see the following Response Headers:

cache-control:private
content-length:7289
content-type:text/plain; charset=utf-8
date:Tue, 09 Jan 2018 04:38:57 GMT
server:Microsoft-IIS/10.0
status:200
x-aspnet-version:4.0.30319
x-powered-by:ASP.NET

And for the subsequent form POST the debugger tool reports these Request Headers:

:authority:vai66tolls.com
:method:POST
:path:/
:scheme:https
accept:*/*
accept-encoding:gzip, deflate, br
accept-language:en-US,en;q=0.9
cache-control:no-cache
content-length:4480
content-type:application/x-www-form-urlencoded; charset=UTF-8
cookie:ASP.NET_SessionId=up5ygvcjzjalnw2z1r1e0qeg
origin:https://vai66tolls.com
pragma:no-cache
referer:https://vai66tolls.com/
user-agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36
x-microsoftajax:Delta=true
x-requested-with:XMLHttpRequest

Also with Chrome, I can generate a curl request the works correctly. Using the curl request I confirmed that removing the Cookies from the header is enough to prevent the correct response from returning. Eg, I recognize that there may be other required form data to be sent, but if I don't have the Cookies it's definitely going to fail.

Questions

  1. Why isn't scrapy including the Cookie in the header of the request?
  2. Is there some way I can manually get the cookie that scrapy is pulling so that I can add it to the FormRequest.from_response() ?

Check that you have also COOKIES_ENABLED set to True in the settings.

As for the second question. You should be able to extract cookies form the headers of Response object with

cookies = response.headers.getlist('Set-Cookie')

You can now manually insert them into the FormRequest passing them as arguments to from_response method. I think it should be possible to either use cookies parameter of Request object, or directly using headers parameter ( headers={'Cookie': xxx} ).

I solve it myself using the answer from here . It's best to handle cookies using cookies attribute instead of headers attribute. Somehow, headers attribute tends to handle cookies badly.

request_with_cookies = Request(url="http://...",cookies={'country': 'UY'})

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM