简体   繁体   English

使用 scrapy 从预订网站抓取,文件 csv 为空

[英]Scraping from booking website using scrapy, the file csv is empty

I'm trying to scrape from booking website the name of hotels that's shown in the first page using the library scrapy in python, but I got an empty file csv, it contains only the name of columns, any suggestions!!我正在尝试使用 python 中的库 scrapy 从预订网站上抓取第一页中显示的酒店名称,但我得到了一个空文件 csv,它只包含列的名称,任何建议! Thank you谢谢

this is the python code:这是 python 代码:

import scrapy
import logging
from scrapy.crawler import CrawlerProcess
from scrapy.exporters import CsvItemExporter

class CsvPipeline(object):
    def __init__(self):
        self.file = open ('duproprio.tmp','wb')
        self.exporter = CsvItemExporter(self.file,str)
        self.exporter.start_exporting()
    def close_spider(self,spider):
        self.exporter.finish_exporting()
        self.file.close()
    def process_items(self,item,spider):
        self.exporter.export_item(item)
        return item

class DuProprioSpider(scrapy.Spider):
    name = "booking"
    start_urls = [
        "https://www.booking.com/searchresults.fr.html?label=gen173nr-1DCAEoggI46AdIM1gEaIwBiAEBmAENuAEXyAEP2AED6AEBiAIBqAIDuALsycKNBsACAdICJGE1YmJmNDE1LWU2ZTEtNGEzMy05MTcyLThkYmQ2OGI5NWE5OdgCBOACAQ&sid=2e5b4623e13363b5ec7de2d7957c8c22&sb=1&sb_lp=1&src=theme_landing_index&src_elem=sb&error_url=https%3A%2F%2Fwww.booking.com%2Fhotel%2Findex.fr.html%3Flabel%3Dgen173nr-1DCAEoggI46AdIM1gEaIwBiAEBmAENuAEXyAEP2AED6AEBiAIBqAIDuALsycKNBsACAdICJGE1YmJmNDE1LWU2ZTEtNGEzMy05MTcyLThkYmQ2OGI5NWE5OdgCBOACAQ%3Bsid%3D2e5b4623e13363b5ec7de2d7957c8c22%3B&ss=Maroc&is_ski_area=&checkin_year=&checkin_month=&checkout_year=&checkout_month=&group_adults=2&group_children=0&no_rooms=1&b_h4u_keep_filters=&from_sf=1&ss_raw=ma&ac_position=1&ac_langcode=fr&ac_click_type=b&dest_id=143&dest_type=country&place_id_lat=32.4281&place_id_lon=-6.92197&search_pageview_id=7ca057bb44b9012d&search_selected=true&search_pageview_id=7ca057bb44b9012d&ac_suggestion_list_length=5&ac_suggestion_theme_list_length=0"]
    
    custom_settings = {
        'LOG_LEVEL':logging.WARNING,
        'ITEM_PIPELINES':{'__main__.CsvPipeline':1},
        'FEED_FORMAT':'csv',
        'FEED_URI':'bookingresult.csv'
        }
    
    #count = 0
    #total = 25
    
    def parse(self,response):
        #self.count =+25
        nexturl = "https://www.booking.com/searchresults.fr.html?label=gog235jc-1DCAIojAFCAm1hSA1YA2iMAYgBAZgBDbgBF8gBD9gBA-gBAfgBAogCAagCA7gCj9q5jQbAAgHSAiQ1MDlhN2M0Ny0yMmYwLTRiNDUtYjNhMC0xY2Y1MTg3NWM5ODfYAgTgAgE&sid=2e5b4623e13363b5ec7de2d7957c8c22&aid=356980&dest_id=-38833&dest_type=city&srpvid=00bd4bf5ca01008f&track_hp_back_button=1&nflt=ht_id%3D204&offset=0"
        for i in response.css('div._814193827>div>div>div>div>a'):
            yield scrapy.Request(url=i.xpath('@href').extract_first(),callback = self.parse_detail) 
        #if self.count < self.total+25:
        yield scrapy.Request(nexturl,self.parse)
    
    
    def parse_detail(self,response):
        nom_hotel = response.css('h2#hp_hotel_name.hp__hotel-name::text').get()
        
        yield{
            'nom_hotel' : nom_hotel.strip()
            }
        
process = CrawlerProcess(
    {
     'USER_AGENT':'Mozilla/4.0 (comatible;MSIE 7.0;Window NT 5.1)'
     })
process.crawl(DuProprioSpider)
process.start()

1. The first result is '\n'. 1.第一个结果是'\n'。 Example for solution with get all: get all 的解决方案示例:

def parse_detail(self, response):
        nom_hotel = response.css('h2#hp_hotel_name.hp__hotel-name::text').getall()
        nom_hotel = ''.join(nom_hotel)
        yield{
            'nom_hotel': nom_hotel.strip()
        }

Output: Output:

nom_hotel
Camp Sahara Holidays
Lovely House at La perle de Cabo Negro
Riad Dar Salam
Hôtel Auberge du Littoral
Kasbah Sirocco
...
...
...

2. Your pipeline is wrong so you'll get the results at the end of file after many blank lines. 2.你的管道是错误的,所以你会在很多空行之后在文件末尾得到结果。 Or instead just use the default exporter:或者只使用默认导出器:

    custom_settings = {
        'LOG_LEVEL': logging.WARNING,
        'FEED_EXPORTERS': {'csv': 'scrapy.exporters.CsvItemExporter'},
        'FEED_FORMAT': 'csv',
        'FEED_URI': 'bookingresult.csv'
    }

3. You don't have to enter each page just to get the name, you can just scrape it from the search results page. 3.你不必输入每个页面只是为了得到名字,你可以从搜索结果页面中抓取它。 example:例子:

    def parse(self, response):
        nexturl = "https://www.booking.com/searchresults.fr.html?label=gog235jc-1DCAIojAFCAm1hSA1YA2iMAYgBAZgBDbgBF8gBD9gBA-gBAfgBAogCAagCA7gCj9q5jQbAAgHSAiQ1MDlhN2M0Ny0yMmYwLTRiNDUtYjNhMC0xY2Y1MTg3NWM5ODfYAgTgAgE&sid=2e5b4623e13363b5ec7de2d7957c8c22&aid=356980&dest_id=-38833&dest_type=city&srpvid=00bd4bf5ca01008f&track_hp_back_button=1&nflt=ht_id%3D204&offset=0"
        all_names = response.xpath('//div[@data-testid="title"]/text()').getall()
        for name in all_names:
            yield {'nom_hotel': name}

It's faster since instead of 26 requests (the first plus 25 search results) you just create 1 request.它更快,因为您只需创建 1 个请求,而不是 26 个请求(第一个加上 25 个搜索结果)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM