简体   繁体   中英

Can't fetch json content from a stubborn webpage using scrapy

I'm trying to create a script using scrapy to grab json content from this webpage . I've used headers within the script accordingly but when I run it, I always end up getting JSONDecodeError . The site sometimes throws captcha but not always. However, I've never got any success using the script below even when I used vpn. How can I fix it?

This is how I've tried:

import scrapy
import urllib

class ImmobilienScoutSpider(scrapy.Spider):
    name = "immobilienscout"
    start_url = "https://www.immobilienscout24.de/Suche/de/nordrhein-westfalen/wohnung-kaufen"
    
    headers = {
        'accept': 'application/json; charset=utf-8',
        'accept-encoding': 'gzip, deflate, br',
        'accept-language': 'en-US,en;q=0.9',
        'x-requested-with': 'XMLHttpRequest',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36',
    }

    params = {
        'price': '1000.0-',
        'constructionyear': '-2000',
        'pagenumber': '1'
    }

    def start_requests(self):
        req_url = f'{self.start_url}?{urllib.parse.urlencode(self.params)}'
        yield scrapy.Request(
            url=req_url,
            headers=self.headers,
            callback=self.parse,
        )

    def parse(self,response):
        yield {"response":response.json()}

This is how the output should look like (truncated):

{"searchResponseModel":{"additional":{"lastSearchApiUrl":"/region?realestatetype=apartmentbuy&price=1000.0-&constructionyear=-2000&pagesize=20&geocodes=1276010&pagenumber=1","title":"Eigentumswohnung in Nordrhein-Westfalen - ImmoScout24","sortingOptions":[{"description":"Standardsortierung","code":0},{"description":"Kaufpreis (höchste zuerst)","code":3},{"description":"Kaufpreis (niedrigste zuerst)","code":4},{"description":"Zimmeranzahl (höchste zuerst)","code":5},{"description":"Zimmeranzahl (niedrigste zuerst)","code":6},{"description":"Wohnfläche (größte zuerst)","code":7},{"description":"Wohnfläche (kleinste zuerst)","code":8},{"description":"Neubau-Projekte (Projekte zuerst)","code":31},{"description":"Aktualität (neueste zuerst)","code":2}],"pagerTemplate":"|Suche|de|nordrhein-westfalen|wohnung-kaufen?price=1000.0-&constructionyear=-2000&pagenumber=%page%","sortingTemplate":"|Suche|de|nordrhein-westfalen|wohnung-kaufen?price=1000.0-&constructionyear=-2000&sorting=%sorting%","world":"LIVING","international":false,"device":{"deviceType":"NORMAL","devicePlatform":"UNKNOWN","tablet":false,"mobile":false,"normal":true}

EDIT:

This is how the script built upon requests module looks like:

import requests

link = 'https://www.immobilienscout24.de/Suche/de/nordrhein-westfalen/wohnung-kaufen'

headers = {
    'accept': 'application/json; charset=utf-8',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'en-US,en;q=0.9',
    'x-requested-with': 'XMLHttpRequest',
    'content-type': 'application/json; charset=utf-8',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36',
    'referer': 'https://www.immobilienscout24.de/Suche/de/nordrhein-westfalen/wohnung-kaufen?price=1000.0-&constructionyear=-2000&pagenumber=1',
    # 'cookie': 'hardcoded cookies'
}

params = {
        'price': '1000.0-',
        'constructionyear': '-2000',
        'pagenumber': '2'
}

sess = requests.Session()
sess.headers.update(headers)
resp = sess.get(link,params=params)
print(resp.json())

Scrapy's CookiesMiddleware disregards 'cookie' passed in headers .
Reference: scrapy/scrapy#1992

Pass cookies explicitly:

yield scrapy.Request(
    url=req_url,
    headers=self.headers,
    callback=self.parse,
    # Add the following line:
    cookies={k: v.value for k, v in http.cookies.SimpleCookie(self.headers.get('cookie', '')).items()},
),

Note: That site uses GeeTest CAPTCHA, which cannot be solved by simply rendering the page or using Selenium, so you still need to periodically update the hardcoded cookie (cookie name: reese84 ) taken from the browser, or use a service like 2Captcha.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM