简体   繁体   中英

Trouble collecting different property ids from a webpage using the requests module

After clicking on the button 11.331 Treffer located at the top right corner within the filter of this webpage , I can see the result displayed on that page. I've created a script using the requests module to fetch the ID numbers of different properties from that page.

However, when I run the script, I get json.decoder.JSONDecodeError . If I copy the cookies from dev tools directly and paste them within the headers, I get the results accordingly.

I don't wish to copy cookies from dev tools every time I run the script, so I used Selenium to collect cookies from the landing page and supply them within headers to get the desired result, but I still get the same error.

I'm trying like:

import time
import requests
from bs4 import BeautifulSoup
from selenium import webdriver

start_url = 'https://www.immobilienscout24.de/'
link = 'https://www.immobilienscout24.de/Suche/de/nordrhein-westfalen/wohnung-kaufen?pagenumber=1'
        
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36',
    'referer': 'https://www.immobilienscout24.de/Suche/de/nordrhein-westfalen/wohnung-kaufen?enteredFrom=one_step_search',
    'accept': 'application/json; charset=utf-8',
    'x-requested-with': 'XMLHttpRequest'
}

def get_cookies():
    with webdriver.Chrome() as driver:
        driver.get(start_url)
        time.sleep(10)
        cookiejar = {c['name']:c['value'] for c in driver.get_cookies()}
        return cookiejar


cookies = get_cookies()
cookie_string = "; ".join([f"{item}={val}" for item,val in cookies.items()])

with requests.Session() as s:
    s.headers.update(headers)
    s.headers['cookie'] = cookie_string
    res = s.get(link)
    container = res.json()['searchResponseModel']['resultlist.resultlist']['resultlistEntries'][0]['resultlistEntry']
    for item in container:
        try:
            project_id = item['@id']
        except KeyError: project_id = ""
        print(project_id)

How can I scrape property ids from that webpage using the requests module?

EDIT:

The existence of the following portion within cookies is crucial, without which the script probably leads to that error I mentioned. However, selenium failed to include that portion within cookies.

reese84=3:/qdGO9he7ld4/8a35vlw8g==:+/xBfAtVPRKHBSJgzngTQw1ywoViUvmVKLws+f8Y6edDgM+3s0Xzo17NvfgPrx9Z/suRy7hee5xcEgo85V3LdGsIop9/29g1ib1JQ0pO3UHWrtn81MseS6G8KE6AF4SrWZ2t8eTr1SEogUmCkB1HNSqXT88sAZaEi+XSzUyAGqikVjEcLX9TeI+KN37QNr9Sl+oTaOPchSgS/IowPj83zvT471Ewabg8CAc6q8I9AJ8Zb9FfLqePweCM+QFKIw+ZUp5GR4TXxZVcWdipbIEAyv3kj2x9Xs1K1k+8aXmy9VES6rFvW1xOsAjLmXbg6REPBye+QcAgPUh/x79mBWktcWC/uQ5L2W2dBLBS4eM2+bpEBw5EHMfjq9bk9hnmmZuxPGALLKASeXBt5lUUwx7x+wtGcjyvB9ZSE6gI2VxFLYqncYmhKqoNzgwQY8wRThaEraiJF/039/vVMa2G3S38iwniiOGHsOxq6VTdnWJGgvJqUmpWfXzz6XQXWL2xcykAoj7LMqHF2tC0DQyInUmZ3T7zjPBV7mEMgZkDn0z272E=:qQHyFe1/pp8/BS4RHAtxftttcOYJH4oqG1mW0+aNXF4=;

I think another part of your problem is that the link is not json. It's an html document. Part of the html document does contains javascript that sets a js variable to a json object. You can't get that with res.json()

In theory, you could use selenium to go to the link and grab the contents of the IS24.resultList variable by executing javascript like this:

driver.get(link)
time.sleep(10)
result_list = json.loads( driver.execute_script("return window.IS24.resultList"))

In practice, I think they're really serious about blocking bots and I suspect convincing them you're not a bot might take more than spoofing a cookie. When I visit via Selenium I don't even get the recaptcha option that I get when visiting through a regular browser session with incognito mode.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM