简体   繁体   English

Python Scrapy-从URL刮下来的不是在start_urls中设置的

[英]Python Scrapy - Scraped from url are not the ones set in start_urls

i'm new using scrapy and I have a doubt about the urls that are scraped. 我是新手,使用scrapy,我对被抓取的网址有疑问。

I'm trying to scrape a site that every page that you go redirects to the homepage, when you click in a banner you can acess other pages. 我正在尝试抓取一个网站,您访问的每个页面都将重定向到首页,当您单击横幅广告时,您可以访问其他页面。 I've tried to use 我尝试使用

meta={'dont_redirect': True, 'handle_httpstatus_list': [301, 302]

to avoid the redirecting but the scraped from url was still wrong. 以避免重定向,但从网址中抓取仍然是错误的。 So i thought that the problem was the cookies and to test it i've hard code the cookies to be the same as the browser when enter the site and now it'isnt redirecting and I dont even need to put the 'dont_redirect' in the meta but when I look the debugger it is still scraping the homepage. 因此,我认为问题出在Cookie上,要对其进行测试,我已经将Cookie硬编码为与进入网站时的浏览器相同,现在它不重定向了,我什至无需在其中添加“ dont_redirect”元,但当我看调试器时,它仍在抓取主页。

for now the code is like this: 现在的代码是这样的:


import scrapy


class MatchOpeningSpider(scrapy.Spider):
    name = 'bet_365_match_opening'
    start_urls = [
        'https://www.bet365.com/#/AC/B1/C1/D13/E38078994/F2/'
    ]

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, cookies={
                'pstk': '04761A56B7A54D9BB3948A093FB9F440000003',
                'rmbs': 3,
                'aps03': 'lng=22&tzi=34&oty=2&ct=28&cg=1&cst=0&hd=N&cf=N',
                'session': 'processform=0&fms=1'
            })

    def parse(self, response):
        games = response.css('div.sl-CouponParticipantWithBookCloses_Name').extract()
        yield {'games': games}

the debug you can see the Crawled url is right but the Scraped from is the homepage 您可以看到Crawled url是正确的调试信息,但是Scraped from是主页

2019-04-21 12:02:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.bet365.com/#/AC/B1/C1/D13/E38078994/F2/> (referer: None)
2019-04-21 12:02:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bet365.com/>

What i'm doing wrong? 我做错了什么? Thanks for helping!!! 感谢您的帮助!!!

In your start_url there is a fragment identifier (the sharp sign: #) in the middle, the context after it will not proceed by browser 在您的start_url中,中间有一个片段标识符 (尖号:#),之后的上下文将不会被浏览器处理

Which means the data you need, might not in the HTTP response of the the start_url , but from some other Ajax calls after this main document request and render by client side 这意味着您需要的数据可能不在start_url的HTTP响应中,而是在此主文档请求并由客户端呈现之后从其他Ajax调用中获取

My suggestions: 我的建议:

  1. Use browser's dev tools, or Scrapy shell , or even CURL tools to ensure, the content you need is exists in the http response of the start_url first. 使用浏览器的dev工具, Scrapy shell甚至CURL工具来确保所需的内容首先存在start_url的http响应中。 Or you're scrapping the wrong URL 否则您报错了URL

  2. Make the http headers, cookies, totally the same with how it goes in a real browser. 使http标头,cookie与其在实际浏览器中的用法完全相同。 Scrapy handle 3xx redirect and cookie changes for you, but you'll need to find and represent the actual visiting path in your spider program Scrapy为您处理3xx重定向和cookie更改,但是您需要在您的Spider程序中找到并表示实际的访问路径

  3. If the data is rendering from client-side and you're tire of this, try Selenium based spider, to use a browser with JS engine to go over these problems 如果数据是从客户端渲染的,并且您对此感到厌倦,请尝试使用基于Selenium的Spider,使用带有JS引擎的浏览器来解决这些问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM