简体   繁体   中英

Callback function not working properly in scrapy

Hi I am new to Scrapy and am trying to scrape an ASP.net site. I have identified the parameters of the form which are called when the form gets posted and have used them in my code. However even though data gets scraped from the first page data is not scraped after that even though the spider indicates that the other pages have been crawled successfully. Stuck trying to figure out why its not working :S . 'clean_parsed_string' and 'get_parsed_string' are my own functions used to get the string elements and have been tested on other websites.

def parse(self, response):
    sel = Selector(response)
    snodes = sel.xpath('//div[@id="hotel_result_hotel_item"]')

    for snode in snodes:
        hotel_item = Hotel_Items()
        hotel_item['name'] = clean_parsed_string(get_parsed_string(snode_restaurant, 'div[@class=""]/table[@class="widthfull"]//a[@class="hot_name"]/text()'))
        hotel_item['address'] = clean_parsed_string(get_parsed_string(snode_restaurant, 'div[@class=""]/table[@class="widthfull"]//span[@class="fontsmalli"]/text()'))
        hotel_item['stars'] = clean_parsed_string(get_parsed_string(snode_restaurant, 'div[@class=""]/table[@class="widthfull"]//div[@class="mbluebold col_hotelinfo_name"]/input/@class'))
        hotel_item['room1'] = clean_parsed_string(get_parsed_string(snode_restaurant,'div[@class=""]/div[@class="showroom_rates"]/table[@class="widthfull text_left"]/tr[1]/td[1]/p[@class="roomtype"]/span/text()'))
        hotel_item['room1_price_USD'] = clean_parsed_string(get_parsed_string(snode_restaurant,'div[@class=""]/div[@class="showroom_rates"]/table[@class="widthfull text_left"]/tr[1]/td[5]/p[@class="ratepernight"]/span/text()'))
        hotel_item['room2'] = clean_parsed_string(get_parsed_string(snode_restaurant,'div[@class=""]/div[@class="showroom_rates"]/table[@class="widthfull text_left"]/tr[2]/td[1]/p[@class="roomtype"]/span/text()'))
        hotel_item['room2_price_USD'] = clean_parsed_string(get_parsed_string(snode_restaurant,'div[@class=""]/div[@class="showroom_rates"]/table[@class="widthfull text_left"]/tr[2]/td[5]/p[@class="ratepernight"]/span/text()'))
        hotel_item['room3'] = clean_parsed_string(get_parsed_string(snode_restaurant,'div[@class=""]/div[@class="showroom_rates"]/table[@class="widthfull text_left"]/tr[3]/td[1]/p[@class="roomtype"]/span/text()'))
        hotel_item['room3_price_USD'] = clean_parsed_string(get_parsed_string(snode_restaurant,'div[@class=""]/div[@class="showroom_rates"]/table[@class="widthfull text_left"]/tr[3]/td[5]/p[@class="ratepernight"]/span/text()'))
        hotel_item['room4'] = clean_parsed_string(get_parsed_string(snode_restaurant,'div[@class=""]/div[@class="showroom_rates"]/table[@class="widthfull text_left"]/tr[4]/td[1]/p[@class="roomtype"]/span/text()'))
        hotel_item['room4_price_USD'] = clean_parsed_string(get_parsed_string(snode_restaurant,'div[@class=""]/div[@class="showroom_rates"]/table[@class="widthfull text_left"]/tr[4]/td[5]/p[@class="ratepernight"]/span/text()'))
        yield hotel_item


    viewstate = sel.xpath('//input[@name="__VIEWSTATE"]/@value').extract()[0]
    yield FormRequest.from_response(response,formdata={'ctl00$scriptmanager1':'ctl00$ContentMain$upResultFooter|ctl00$ContentMain$lbtnFooterNext',
                'ctl00_scriptmanager1_HiddenField':'',
                '__EVENTTARGET':'ctl00$ContentMain$lbtnFooterNext',
                '__EVENTARGUMENT':'',
                '__LASTFOCUS':'',
                '__VIEWSTATE': viewstate,
                '__SCROLLPOSITIONX':'0',
                '__SCROLLPOSITIONY':'0',
                'ctl00$Googlesearch$txtSearch':'',
                'ctl00$ddlCurrency$hidCurrencyChange':'USD',
                'ctl00$ContentMain$hdfMinPrice':'',
                'ctl00$ContentMain$hdfMaxPrice':'',
                'ctl00$ContentMain$ddlSort':'1',    
                'ctl00$ContentMain$hidMenu':'0',
                'ctl00$ContentMain$hidSubMenu':'',
                'ctl00$ContentMain$DestinationSearchBox1$arrivaldate':'06/23/2014',
                'ctl00$ContentMain$DestinationSearchBox1$departdate':'06/25/2014',
                'ctl00$ContentMain$DestinationSearchBox1$controlmode':'1',
                'ctl00$ContentMain$DestinationSearchBox1$jsRooms':'0',  
                'ctl00$ContentMain$DestinationSearchBox1$jsAdults':'0',
                'ctl00$ContentMain$DestinationSearchBox1$jsChildren':'0',
                'ctl00$ContentMain$DestinationSearchBox1$SearchHotel':'no',
                'ctl00$ContentMain$DestinationSearchBox1$ErrorCharLengthMessage':'Please enter at least the first two letters of the name you are looking for.',
                'ctl00$ContentMain$DestinationSearchBox1$TextError':'Please enter the name of a Country, City, Airport, Area, Landmark or Hotel to proceed.',
                'ctl00$ContentMain$DestinationSearchBox1$TextSearch1$tmptextDefault':'Country, City, Airport, Area, Landmark',
                'ctl00$ContentMain$DestinationSearchBox1$TextSearch1$txtSearch':'Colombo',
                'ctl00$ContentMain$DestinationSearchBox1$ddlDistance':'1',
                'ddlCheckInDay':'23',
                'ddlCheckInMonthYear':'6,2014',
                'datepickerarrival':'',
                'ddlCheckOutDay':'25',
                'ddlCheckOutMonthYear':'6,2014',
                'ctl00$ContentMain$DestinationSearchBox1$ddlNights':'2',
                'datepickerdepart':'',
                'ctl00$ContentMain$DestinationSearchBox1$ddlRoom':'1',
                'ctl00$ContentMain$DestinationSearchBox1$ddlAdult':'2',
                'ctl00$ContentMain$DestinationSearchBox1$ddlChildren':'0',
                'ctl00$ContentMain$txtHotelName':'',
                'ctl00$ContentMain$hidHotelList2603':'',
                'ctl00$ContentMain$HotelFilterStarRating$HiddenFilterStatus':'',
                'ctl00$ContentMain$HotelFilterFacilities$HiddenFilterStatus':'',
                'ctl00$ContentMain$HotelFilterAccommodationType$HiddenFilterStatus':'',
                'ctl00$ContentMain$HotelFilterArea$HiddenFilterStatus':'',
                'ctl00$ContentMain$HotelFilterChainAndBrand$HiddenFilterStatus':'',
                #'__ASYNCPOST':'true'
                },
            callback=self.parse,clickdata=None)

It's possible that a site may return a 200 OK status even though your POST headers are wrong. Try using scrapy shell and submit a FormRequest with the formdata you made to see what the site returns.

I suggest using something similar to this to avoid having to type every header and avoiding possible mistakes:

formdata = {}

for hid in sel.xpath('//input[@type="hidden" and @value and @name]'):
    formdata[hid.xpath('@name').extract()[0]] = hid.xpath('@value').extract()[0]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM