简体   繁体   English

网址在scrapy中重定向

[英]Url get redirect in scrapy

I am new to scrapy and trying to scrap hotels name for booking.com, just playing. 我是新手,只是想玩玩而已,所以我试图取消booking.com的酒店名称。 The response URL is different from request URL. 响应URL与请求URL不同。 I also want to get Hotels names from all pages. 我也想从所有页面获取酒店名称。

    class BookingSpider(CrawlSpider):
    name = 'booking.com_mumbai'
    allowed_domains = ['booking.com']
    start_urls = [
        'https://www.booking.com/searchresults.en-gb.html?aid=304142&label=gen173nr-1DCAEoggJCAlhYSDNiBW5vcmVmaGyIAQGYAS7CAQN4MTHIAQzYAQPoAQGSAgF5qAID&sid=73f533eb666233525bc516654c914549&checkin_month=4&checkin_monthday=26&checkin_year=2017&checkout_month=4&checkout_monthday=30&checkout_year=2017&class_interval=1&dest_id=20014181&dest_type=city&dtdisc=0&group_adults=2&group_children=0&inac=0&index_postcard=0&label_click=undef&mih=0&no_rooms=1&postcard=0&raw_dest_type=city&room1=A%2CA&sb_price_type=total&search_selected=1&src=index&src_elem=sb&ss=Los%20Angeles%2C%20California%2C%20USA%2C%20North%20America%2C%20CA&ss_all=0&ss_raw=Los%20Angeles%2C%20CA%3B&ssb=empty&sshis=0&rows=40&offset=40'
    ]

    rules = (
        Rule(LinkExtractor(allow='rows?.\w+', unique=True), follow=True, callback="parse"),
    )

    def parse_hotel_item(self):
        pass

    def parse(self, response):
        hotels = response.xpath('//*[@id="hotellist_inner"]/div')
        for hotel in hotels:
            print hotel.xpath("//h3/a/span[2]/text()").extract()
        print("Done")

logs 日志

2017-04-15 23:40:48 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrap_booking_dot_com)
2017-04-15 23:40:48 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'scrap_booking_dot_com.spiders', 'SPIDER_MODULES': ['scrap_booking_dot_com.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'scrap_booking_dot_com'}
2017-04-15 23:40:48 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2017-04-15 23:40:48 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-04-15 23:40:48 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-04-15 23:40:48 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-04-15 23:40:48 [scrapy.core.engine] INFO: Spider opened
2017-04-15 23:40:48 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-04-15 23:40:48 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-04-15 23:40:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.booking.com/robots.txt> (referer: None)
2017-04-15 23:40:49 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.booking.com/searchresults.en-gb.html?aid=304142&label=gen173nr-1DCAEoggJCAlhYSDNiBW5vcmVmaGyIAQGYAS7CAQN4MTHIAQzYAQPoAQGSAgF5qAID&sid=73f533eb666233525bc516654c914549&checkin_month=4&checkin_monthday=26&checkin_year=2017&checkout_month=4&checkout_monthday=30&checkout_year=2017&class_interval=1&dest_id=20014181&dest_type=city&dtdisc=0&group_adults=2&group_children=0&inac=0&index_postcard=0&label_click=undef&mih=0&no_rooms=1&postcard=0&raw_dest_type=city&room1=A%2CA&sb_price_type=total&search_selected=1&src=index&src_elem=sb&ss=Los%20Angeles%2C%20California%2C%20USA%2C%20North%20America%2C%20CA&ss_all=0&ss_raw=Los%20Angeles%2C%20CA;&ssb=empty&sshis=0&rows=40&offset=40> from <GET https://www.booking.com/searchresults.en-gb.html?aid=304142&label=gen173nr-1DCAEoggJCAlhYSDNiBW5vcmVmaGyIAQGYAS7CAQN4MTHIAQzYAQPoAQGSAgF5qAID&sid=73f533eb666233525bc516654c914549&checkin_month=4&checkin_monthday=26&checkin_year=2017&checkout_month=4&checkout_monthday=30&checkout_year=2017&class_interval=1&dest_id=20014181&dest_type=city&dtdisc=0&group_adults=2&group_children=0&inac=0&index_postcard=0&label_click=undef&mih=0&no_rooms=1&postcard=0&raw_dest_type=city&room1=A%2CA&sb_price_type=total&search_selected=1&src=index&src_elem=sb&ss=Los%20Angeles%2C%20California%2C%20USA%2C%20North%20America%2C%20CA&ss_all=0&ss_raw=Los%20Angeles%2C%20CA%3B&ssb=empty&sshis=0&rows=40&offset=40>
2017-04-15 23:40:49 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.booking.com/searchresults.en-gb.html?dest_id=20014181;dest_type=city;offset=30;ss=Los%2520Angeles%252C%2520California%252C%2520USA%252C%2520North%2520America%252C%2520CA> from <GET https://www.booking.com/searchresults.en-gb.html?aid=304142&label=gen173nr-1DCAEoggJCAlhYSDNiBW5vcmVmaGyIAQGYAS7CAQN4MTHIAQzYAQPoAQGSAgF5qAID&sid=73f533eb666233525bc516654c914549&checkin_month=4&checkin_monthday=26&checkin_year=2017&checkout_month=4&checkout_monthday=30&checkout_year=2017&class_interval=1&dest_id=20014181&dest_type=city&dtdisc=0&group_adults=2&group_children=0&inac=0&index_postcard=0&label_click=undef&mih=0&no_rooms=1&postcard=0&raw_dest_type=city&room1=A%2CA&sb_price_type=total&search_selected=1&src=index&src_elem=sb&ss=Los%20Angeles%2C%20California%2C%20USA%2C%20North%20America%2C%20CA&ss_all=0&ss_raw=Los%20Angeles%2C%20CA;&ssb=empty&sshis=0&rows=40&offset=40>
2017-04-15 23:40:50 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.booking.com/searchresults.en-gb.html?dest_id=20014181;dest_type=city;offset=30;ss=Los%20Angeles%2C%20California%2C%20USA%2C%20North%20America%2C%20CA> from <GET https://www.booking.com/searchresults.en-gb.html?dest_id=20014181;dest_type=city;offset=30;ss=Los%2520Angeles%252C%2520California%252C%2520USA%252C%2520North%2520America%252C%2520CA>
2017-04-15 23:40:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.booking.com/searchresults.en-gb.html?dest_id=20014181;dest_type=city;offset=30;ss=Los%20Angeles%2C%20California%2C%20USA%2C%20North%20America%2C%20CA> (referer: None)

I am getting different results. 我得到了不同的结果。

According to your logs and source code your scraper is starting at 根据您的日志和源代码,您的刮板开始于

https://www.booking.com/searchresults.en-gb.html?dest_id=20014181;dest_type=city;offset=30;ss=Los%20Angeles%2C%20California%2C%20USA%2C%20North%20America%2C%20CA https://www.booking.com/searchresults.en-gb.html?dest_id=20014181;dest_type=city;offset=30;ss=Los%20Angeles%2C%20California%2C%20USA%2C%20North%20America% 2C%20CA

and then is redirected multiple times so it ends up scraping 然后多次重定向,最终导致抓取

https://www.booking.com/searchresults.en-gb.html?aid=304142&label=gen173nr-1DCAEoggJCAlhYSDNiBW5vcmVmaGyIAQGYAS7CAQN4MTHIAQzYAQPoAQGSAgF5qAID&sid=73f533eb666233525bc516654c914549&checkin_month=4&checkin_monthday=26&checkin_year=2017&checkout_month=4&checkout_monthday=30&checkout_year=2017&class_interval=1&dest_id=20014181&dest_type=city&dtdisc=0&group_adults=2&group_children=0&inac=0&index_postcard=0&label_click=undef&mih=0&no_rooms=1&postcard=0&raw_dest_type=city&room1=A%2CA&sb_price_type=total&search_selected=1&src=index&src_elem=sb&ss=Los%20Angeles%2C%20California%2C%20USA%2C%20North%20America%2C%20CA&ss_all=0&ss_raw=Los%20Angeles%2C%20CA%3B&ssb=empty&sshis=0&rows=40&offset=40 https://www.booking.com/searchresults.en-gb.html?aid=304142&label=gen173nr-1DCAEoggJCAlhYSDNiBW5vcmVmaGyIAQGYAS7CAQN4MTHIAQzYAQPoAQGSAgF5qAID&sid=73f533eb666233525bc516654c914549&checkin_month=4&checkin_monthday=26&checkin_year=2017&checkout_month=4&checkout_monthday=30&checkout_year=2017&class_interval=1&dest_id=20014181&dest_type=city&dtdisc=0&group_adults=2&group_children= 0&INAC = 0&index_postcard = 0&label_click =是undef&MIH = 0&no_rooms = 1&明信片= 0&raw_dest_type =城市&ROOM1 = A%2CA&sb_price_type =总&search_selected = 1&SRC =指数&src_elem = SB&SS =洛%20Angeles%2C%20California%2C%20USA%2C%20North%20America%2C%20CA&ss_all = 0&ss_raw =洛%20Angeles%2C%20CA%3B&SSB =空&sshis = 0&行= 40&偏移量= 40

And there is nothing wrong with this. 这没有错。

Why? 为什么?

That's the way booking.com handles your start url. 这就是booking.com处理您的起始网址的方式。 It's nothing special happening to your scraper, the same happens when trying to access the start url in a browser. 刮板没有什么特别的事情,尝试在浏览器中访问起始URL时也是如此。

Just paste your start url into a browser and you will see the browser's URL line change to an almost identical long URL. 只需将您的起始URL粘贴到浏览器中,您将看到浏览器的URL行变为几乎相同的长URL。

So the response url is fine. 因此,响应网址很好。

The real problem comes from the xpath "//h3/a/span[2]/text()" in the for loop. 真正的问题来自for循环中的xpath "//h3/a/span[2]/text()"

There are two problems with it: 它有两个问题:

  1. as the xpath starts with '/' it will work on the root object while you most probably want it to work on each of the hotel elements that you are iterating over. 由于xpath以'/'开头,因此它将在根对象上运行,而您最有可能希望它在要迭代的每个hotel元素上运行。

The fix is to add a '.' 解决方法是添加“。”。 to the front of the xpath (similar to using a '.' when navigating folders on a file system). 到xpath的前面(类似于在文件系统上浏览文件夹时使用“。”)。

Example: ".//h3/a/span[2]/text()" 示例: ".//h3/a/span[2]/text()"

  1. Nonetheless I couldn't make your xpath work. 但是,我无法使您的xpath正常工作。 Crafted the following xpath instead which works fine for me: ".//span[@class='sr-hotel__name']/text()" 精心设计了以下xpath,它对我来说很".//span[@class='sr-hotel__name']/text()"".//span[@class='sr-hotel__name']/text()"

(a good xpath should rely preferable on ids and class names, instead of a particular nesting of elements, as the former are less at risk to change) (一个好的xpath应该优先使用id和类名,而不是元素的特定嵌套,因为前者的更改风险较小)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM