Scrapy Spider不遵循使用yield的請求回調

Question

我是新手，即使我將它設置為回調，我也無法讓我的蜘蛛在下面的代碼中輸入parse_votes。 其他解析方法工作正常，我沒有收到任何錯誤，並檢查了具有正確信息的'link'變量。 救命？

編輯-完整代碼

class DeputadosSpider(scrapy.Spider):
    name = "deputies"

    allowed_domains = ["camara.leg.br"]
    start_urls = ["http://www2.camara.leg.br/deputados/pesquisa"]

    def parse(self, response):
        sel = Selector(response)
        sel_options = sel.xpath('//*[@id="deputado"]/option[position()>1]')
        iteration = 1
        # get deputies pages
        for sel_option in sel_options:
            item = DeputiesInfo()           
            item["war_name"] = sel_option.xpath("text()").extract()
            item["link_id"] = sel_option.extract().partition('?')[-1].rpartition('"')[0]
            item["page_link"] = 'http://www.camara.leg.br/internet/Deputado/dep_Detalhe.asp?id=' + item["link_id"]
            item["id"] = iteration
            iteration += 1
            # go scrap their page
            yield scrapy.Request(item["page_link"], callback=self.parse_deputy, meta={'item': item})

    def parse_deputy(self, response):
        item = response.meta['item']
        sel = Selector(response)
        info = sel.xpath('//div[@id="content"]/div/div[1]/ul/li')
        # end to fill the data
        item["full_name"] = info.xpath("text()").extract_first()
        item["party"] = info.xpath("text()").extract()[2].partition('/')[0]
        item["uf"] = info.xpath("text()").extract()[2].partition('/')[-1].rpartition('/')[0]
        item["legislatures"] = info.xpath("text()").extract()[5]
        item["picture"] = sel.xpath('//div[@id="content"]/div/div[1]//img[1]/@src').extract()
        # save data to json file 
        file = open('deputies_info.json', 'a')
        line = json.dumps(dict(item)) + ",\n"
        file.write(line)
        # colect votes info
        get_years = sel.xpath('//*[@id="my-informations"]/div[3]/div/ul/li[1]/a[position()<4]')
        for get_year in get_years:
            vote = VotesInfo()
            vote["deputy_id"] = item["id"]
            vote["year"] = get_year.xpath("text()").extract_first()
            link = get_year.xpath("@href").extract_first()
            print(vote["year"])
            print(link)
            # go to voting pages
            yield scrapy.Request(link, callback=self.parse_votes, meta={'vote': vote})

    def parse_votes(self, response):
        #vote = response.meta['vote']
        print('YYYYYYYYYYYYYUHUL IM IN!!')

Answer 1

您的問題是allowed_domains ，因為您要嘗試在parse_deputy請求的鏈接例如： http : //www.camara.gov.br/internet/deputado/RelVotacoes.asp? parse_deputy = 30 /二千零十六分之十二及其域camara.gov.br所以它添加到allowed_domains 。

allowed_domains = ["camara.leg.br", "camara.gov.br"]

PS：我跑你的代碼注釋allowed_domains和parse_votes完美的作品。

Answer 2

我跑了你的蜘蛛，發現為什么它會進入parse_votes 。

我檢查了link的yield scrapy.Request(link, callback=self.parse_votes, meta={'vote': vote})並發現它是不是在同一個域中

該link屬於camara.gov.br域，該域不屬於allowed_domains = ["camara.leg.br"]

因此，您需要將此域添加到allowed_domains列表中。

allowed_domains = ["camara.leg.br", "camara.gov.br"]

Scrapy Spider不遵循使用yield的請求回調

問題描述

2 個解決方案

解決方案1
1 2017-08-09 15:08:12

解決方案2
0 已采納 2017-08-09 15:03:41

Scrapy Spider不遵循使用yield的請求回調

問題描述

2 個解決方案

解決方案1 1 2017-08-09 15:08:12

解決方案2 0 已采納 2017-08-09 15:03:41

解決方案1
1 2017-08-09 15:08:12

解決方案2
0 已采納 2017-08-09 15:03:41