简体   繁体   中英

Python Scrapy - Scraped from url are not the ones set in start_urls

i'm new using scrapy and I have a doubt about the urls that are scraped.

I'm trying to scrape a site that every page that you go redirects to the homepage, when you click in a banner you can acess other pages. I've tried to use

meta={'dont_redirect': True, 'handle_httpstatus_list': [301, 302]

to avoid the redirecting but the scraped from url was still wrong. So i thought that the problem was the cookies and to test it i've hard code the cookies to be the same as the browser when enter the site and now it'isnt redirecting and I dont even need to put the 'dont_redirect' in the meta but when I look the debugger it is still scraping the homepage.

for now the code is like this:


import scrapy


class MatchOpeningSpider(scrapy.Spider):
    name = 'bet_365_match_opening'
    start_urls = [
        'https://www.bet365.com/#/AC/B1/C1/D13/E38078994/F2/'
    ]

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, cookies={
                'pstk': '04761A56B7A54D9BB3948A093FB9F440000003',
                'rmbs': 3,
                'aps03': 'lng=22&tzi=34&oty=2&ct=28&cg=1&cst=0&hd=N&cf=N',
                'session': 'processform=0&fms=1'
            })

    def parse(self, response):
        games = response.css('div.sl-CouponParticipantWithBookCloses_Name').extract()
        yield {'games': games}

the debug you can see the Crawled url is right but the Scraped from is the homepage

2019-04-21 12:02:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.bet365.com/#/AC/B1/C1/D13/E38078994/F2/> (referer: None)
2019-04-21 12:02:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bet365.com/>

What i'm doing wrong? Thanks for helping!!!

In your start_url there is a fragment identifier (the sharp sign: #) in the middle, the context after it will not proceed by browser

Which means the data you need, might not in the HTTP response of the the start_url , but from some other Ajax calls after this main document request and render by client side

My suggestions:

  1. Use browser's dev tools, or Scrapy shell , or even CURL tools to ensure, the content you need is exists in the http response of the start_url first. Or you're scrapping the wrong URL

  2. Make the http headers, cookies, totally the same with how it goes in a real browser. Scrapy handle 3xx redirect and cookie changes for you, but you'll need to find and represent the actual visiting path in your spider program

  3. If the data is rendering from client-side and you're tire of this, try Selenium based spider, to use a browser with JS engine to go over these problems

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM